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A 


A  View  of  Unconstrained  Optimization 


Abstract 


Finding  the  unconstrained  minimizer  of  a  function  of  more  than  one  variable  is  an  important  problem 
with  many  practical  applications,  including  data  fitting,  engineering  design,  and  process  control.  In  addi¬ 
tion,  techniques  for  solving  unconstrained  optimization  problems  form  the  basis  for  most  methods  for  solv¬ 
ing  constrained  optimization  problems.  This  paper  surveys  the  state  of  the  art  for  solving  unconstrained 


optimization  problems  and  the  closely  related  problem  of  solving  systems  of  nonlinear  equations.  First  W  r 

♦•Key 

briefly  give  some  mathematical  background.  Then  we  discuss  Newton’s  method,  the  fundamental  method 
underlying  most  approaches  to  these  problems,  as  well  as  the  inexact  Newton  method.  The  two  main  prac- 


deal  deficiencies  of  Newton’s  method,  the  need  for  analytic  derivatives  and  the  possible  failure  to  converge 
to  the  solution  from  poor  starting  points,  are  the  key  issues  in  unconstrained  optimization  and  are  addressed 
nextNiUe^dtsevcr  a  variety  of  techniques  for  approximating  derivatives,  including  finite  difference  approxi¬ 
mations,  secant  methods  for  nonlinear  equations  and  unconstrained  optimization,  and  the  extension  of  these 
techniques  to  solving  large,  sparse  problems.  Then  w»disctusThe  main  methods  used  to  ensure  conver¬ 


\  r  J 


gence  from  poor  starting  points,  line  search  methods  and  trust  region  methods.  Next  wa  briefly-discass  two 
rather  different  approaches  to  unconstrained*  optimization,  the  Nelder-Meade  simplex  method  and  conju¬ 
gate  direction  methods.  Finally  we  comment  on  some  current  research  directions  in  the  field,  in  the  solu- 

\ 

lion  of  large  problems,  thesolution  of  data  fitting  problems,  new  secant  methods,  the  solution  of  singular 
problems,  and  the  use  of  parallel  computers  in  unconstrained  optimization. 


1.  Preliminaries 


1.1  Introduction 

This  chapter  discusses  the  most  basic  nonlinear  optimization  problem  in  continuous  variables,  the 
unconstrained  optimization  problem.  This  is  the  problem  of  finding  the  minimizing  point  of  a  nonlinear 
function  of  n  real  variables,  and  it  will  be  denoted 

minimize  /  :  IR"  -*R  .  (j  j) 

It  will  be  assumed  that  /  (x)  is  at  least  twice  continuously  differentiable. 

The  basic  methods  for  unconstrained  optimization  are  most  easily  understood  through  their  relation 
to  the  basic  methods  for  a  second  nonlinear  algebraic  problem,  the  nonlinear  equations  problem.  This  is 
the  problem  of  finding  the  simultaneous  solution  of  n  nonlinear  equations  in  n  unknowns,  denoted 

given  F  :  1R"  -»1R*  ,  find  x.  for  which  F(x.)  =  0  (1.2) 

where  F(x)  is  assumed  to  be  at  least  once  continuously  differentiable.  Therefore,  this  chapter  also  will  dis¬ 
cuss  the  nonlinear  equations  problem  as  necessary  for  understanding  unconstrained  optimization. 

Unconstrained  optimization  problems  arise  in  virtually  all  areas  of  science  and  engineering,  and  in 
many  areas  of  the  social  sciences.  In  our  experience,  a  significant  percentage  of  real-world  unconstrained 
optimization  problems  are  data  fitting  problems  (see  Section  6.2).  The  size  of  real-world  unconstrained 
optimization  problems  is  widely  distributed,  varying  from  small  problems,  say  with  n  between  2  and  10,  to 
large  problems,  say  with  n  in  the  hundreds  or  thousands.  In  many  cases,  the  objective  function  /(x)  is  a 
computer  routine  that  is  expensive  to  evaluate,  so  that  even  small  problems  often  are  expensive  and 
difficult  to  solve. 

The  user  of  an  unconstrained  optimization  method  is  expected  to  provide  the  function  /(x)  and  a 
starting  guess  to  the  solution,  xo-  The  routine  is  expected  to  return  an  estimate  of  a  local  minimizcr  x.  of 
/(x),  the  lowest  point  in  some  open  subregion  of  IR".  The  user  optionally  may  provide  routines  for 
evaluating  the  first  and  second  partial  derivatives  of  /  (x),  but  in  most  cases  they  are  not  provided  and 
instead  are  approximated  in  various  ways  by  the  algorithm.  Approximating  these  derivatives  is  one  of  the 
main  challenges  of  creating  unconstrained  optimization  methods.  The  other  main  challenge  is  to  create 


methods  that  will  converge  to  a  local  minimizer  even  if  xo  is  far  from  any  minimum  point.  This  is  referred 
to  as  the  globed  phase  of  the  method.  The  part  of  the  method  that  converges  quickly  to  x» ,  once  it  is  close 
to  it,  is  referred  to  as  the  local  phase  of  the  method. 

This  emphasis  of  this  chapter  is  on  modem,  efficient  methods  for  solving  unconstrained  optimization 
problems,  with  particular  attention  given  to  the  main  areas  of  difficulty  mentioned  above.  Since  function 
evaluation  so  often  is  expensive,  the  primary  measure  of  efficiency  often  is  the  number  of  function  (and 
derivative)  evaluations  required.  For  problems  with  large  numbers  of  variables,  the  number  of  arithmetic 
operations  required  by  the  method  itself  (aside  from  function  evaluations)  and  the  storage  requirements  of 
the  method  become  increasingly  important. 

The  remainder  of  this  section  reviews  some  basic  mathematics  underlying  this  area  and  the  rest  of 
continuous  optimization.  Section  2  discusses  the  basic  local  method  for  unconstrained  optimization  and 
nonlinear  equations,  Newton’s  method.  Section  3  discusses  various  approaches  to  approximating  deriva¬ 
tives  when  they  aren’t  provided  by  die  user.  These  include  finite  difference  approximations  of  the  deriva¬ 
tives,  and  secant  approximations,  less  accurate  but  less  expensive  approximations  that  have  proven  to  lead 
to  more  efficient  algorithms  far  problems  where  function  evaluation  is  expensive.  We  concentrate  on  the 
most  successful  secant  method  for  unconstrained  optimization,  the  “BFGS”  method,  and  as  motivation  we 
also  cover  the  most  successful  secant  method  for  nonlinear  equations,  Broy den's  method.  Sections  2  and  3 
cover  both  small  and  large  dimension  problems,  although  the  solution  of  small  problems  is  better  under¬ 
stood  and  therefore  is  discussed  in  more  detail.  Methods  that  are  used  when  starting  far  from  the  solution, 
called  global  methods,  are  covered  in  Section  4.  The  two  main  approaches,  line  search  methods  and  trust 
region  methods,  both  are  covered  in  some  detail.  In  Section  S  we  cover  two  important  methods  that  do  not 
fit  conveniently  in  the  Taylor  series  approach  that  underlies  Sections  2  through  4.  These  are  the  Nelder- 
Mead  simplex  method,  an  important  method  for  solving  problems  in  very  few  variables,  and  conjugate  gra¬ 
dient  methods,  one  of  the  leading  approaches  to  problems  with  very  many  variables.  Section  6  briefly  sur¬ 
veys  a  number  of  current  research  directions  in  unconstrained  optimization.  These  include  further 
approaches  to  solving  problems  in  many  variables,  special  methods  for  problems  arising  from  data  fitting, 
further  issues  in  secant  approximation  of  derivatives,  the  solution  of  problems  where  the  Jacobian  or  Hes¬ 
sian  matrix  at  the  solution  is  singular,  and  the  use  of  parallel  computers  in  solving  unconstrained  optimiza- 


tion  problems. 

The  scope  of  this  chapter  is  limited  in  many  important  ways  which  are  then  addressed  by  other 
chapters  of  this  book.  We  do  not  consider  constrained  optimization  problems,  problems  where  the  domain 
of  permissible  solutions  is  limited  to  those  variables  x  satisfying  one  or  more  equality  or  inequality  con¬ 
straints.  Constrained  optimization  problems  in  continuous  variables  are  discussed  in  Chapter  3;  the  tech¬ 
niques  for  solving  these  problems  draw  heavily  upon  the  techniques  for  solving  unconstrained  problems. 
We  do  not  consider  problems  where  f(x)  is  not  differentiable;  these  are  discussed  in  Chapter  9.  We  do  not 
discuss  methods  that  attempt  to  find  the  global  minimizer,  the  lowest  of  the  possibly  multiple  local  minim- 
izers  of  /  (x).  Methods  for  finding  the  global  minimizer,  called  gbbal  optimization  methods,  are  discussed 
in  Chapter  11.  Finally,  we  do  not  consider  problems  where  some  or  all  of  the  variables  are  restricted  to  a 
discrete  set  of  values,  for  example  the  integers;  these  problems  must  be  solved  by  techniques  such  as  those 
discussed  in  Chapters  2,  4,  S,  and  6.  It  should  be  noted  that  in  most  nonlinear  optimization  problems 
solved  today,  the  variables  are  continuous,  fix)  is  differentiable,  and  a  local  minimizer  provides  a  satisfac¬ 
tory  solution.  This  probably  reflects  available  software  as  well  as  the  needs  of  practical  applications. 

A  number  of  books  give  substantial  attention  to  unconstrained  optimization  and  are  recommended  to 
readers  who  desire  additional  information  on  this  topic.  These  include  Ortega  and  Rheinboldt  [1970], 
Fletcher  [1980],  Gill.  Murray,  and  Wright  [1981],  and  Dennis  and  Schnabel  [1983]. 

1.2  Taylor  Series  Models 

Most  methods  for  optimizing  nonlinear  differentiable  functions  of  continuous  variables  rely  heavily 
upon  Taylor  series  expansions  of  these  functions.  This  section  briefly  reviews  the  Taylor  series  expansions 
used  in  unconstrained  (and  constrained)  optimization  and  in  solving  systems  of  nonlinear  equations,  and  a 
few  mathematical  properties  of  these  expansions.  The  material  of  this  section  is  developed  fully  in  Ortega 
and  Rheinboldt  [1970]  or  Dennis  and  Schnabel  [1983]. 

The  fundamental  Taylor  series  approximation  used  in  solving  a  system  of  nonlinear  equations  (1.2)  is 
the  first  two  terms  of  the  Taylor  series  approximation  to  Fix)  around  a  current  point  xc , 


Here  the  notation  M?  stands  for  the  current  Newton  model  because  we  will  see  in  Section  2  that  local 
methods  for  solving  nonlinear  equations  are  based  on  Taylor  series  models  of  the  form  (1.3),  and  Newton’s 
method  is  based  specifically  on  (1.3).  J(xc)  denotes  the  n  xn  Jacobian  matrix  of  first  partial  derivatives  of 


F (x )  at  xc\  in  general, 


(/(jr)L>  =  .  i-1. --  A,  7=1. •  •  •  A 

UXj 


where  F,(x)  denotes  the  i*  component  function  olF(x). 

The  local  convergence  analysis  of  methods  based  upon  (1.3)  depends  on  the  difference  between 
F(x)  and  M?(x).  This  and  the  next  error  bounds  in  this  subsection,  as  well  as  most  convergence  results  in 
continuous  optimization,  use  the  concept  of  Lipschitz  continuity.  A  function  G  of  the  n  -vector  x  is  said  to 
be  Lipschitz  continuous  with  constant  y  in  an  open  neighborhood  D  clR" ,  written  G  e  LipfD ),  if  for  all 
x.y  eD, 

1 1  IG(x)-G(y)l  1 1  S y  llx  -y  II  (1.5) 

where  II  •  II  and  1 1 1  -  1 1 1  are  appropriate  norms.  Now  if  7(x)e  LipfD )  in  an  open  neighborhood 

D  elR"  containing  x  and  Xe ,  using  a  vector  norm  II  •  II  and  the  associated  matrix  norm,  then  it  is  a  stan¬ 
dard  result  that 


IIA#£,(x)-F(x)ll  £  -Jf-  llx  —Xe  II2.  (1.6) 

This  result  is  similar  to  the  familiar  Taylor  series  with  remainder  results  for  functions  of  one  variable  but 
requires  no  mention  of  higher  derivatives  of  F . 

The  fundamental  Taylor  series  expansion  used  in  unconstrained  optimization  is  the  first  three  terms 
of  the  Taylor  series  of /(x)aroundxe, 

mptec  +d) -/(*)  +  gC*)rd  +  Vi d*H{Xc)d  .  (1.7) 

Here  g(re)  denotes  the  n  component  gradient  column  vector  of  first  partial  derivatives  of  fix)  evaluated 


atXei  in  general 


.  7=1.  -t.  . 


Also,  H(xc)  denotes  the  n  x  n  Hessian  matrix  of  second  partial  derivatives  of  /  (x )  evaluated  atx*;  in  gen- 


Note  that// (z)  is  symmetric  if /(x)  is  twice  continuously  differentiable. 

If  H  (x)  is  Lipschitz  continuous  with  constant  yin  an  open  neighborhood  of  IR"  containing  x*  and  xc , 
using  the  norm  II  •  II ,  then  it  is  a  standard  result  that 

I mj'(x)-/(z)ls£  llx-xc  II3 .  (1.10) 

This  result  is  used  in  convergence  analysis  of  unconstrained  optimization  methods. 

The  standard  Taylor  series  with  remainder  results  from  the  calculus  of  one  variable  also  extend  to 
single  valued  functions  of  multiple  variables.  For  any  direction  d  €  IR*  there  exist  tt,  ti  €  [0,r  ]  for  which 

f(x+d)=f(x)  +  g(x+tld)Td  (1.11) 

and 

f(x+d)=f(x)  +  g(x)Td  +  1AdTH(x+t*l)d  .  (1.12) 

These  results  are  the  keys  to  the  necessary  and  sufficient  conditions  for  unconstrained  minimization  that  we 

consider  next 

1.3  Necessary  and  Sufficient  Conditions  for  Unconstrained  Optimization 

Algorithms  for  solving  the  unconstrained  minimization  problem  are  based  upon  the  first  and  second 
order  conditions  for  a  point  x.  to  be  a  local  minimizer  of  /  (x).  These  conditions  are  briefly  reviewed  in 
this  section. 

For  x*  to  be  a  local  minimizer  of  / (x),  it  is  necessary  that  g  (x* )  =  0,  where  g(x)  denotes  the  gra¬ 
dient  defined  in  (1.8). 

Theorem  1.1  Let  /  (x) :  IR*  -*R  be  continuously  differentiable,  and  let  y  e  IR* .  If  g  (y)*0,  then  y  is  not  a 
local  minimizer  of  /  (x). 

Proof:  If  g(yM),  then  there  exist  directions  d  elR"  for  which  g(y)Td  <  0;  an  example  is  d  ■  -g(y).  For 
any  such  direction  d,  we  have  from  (1.11)  that 

f(y  +  td)-f(y)  =  t-dTg(y+t\d)  (1.13) 

for  some  /j  €  (0,/).  Also  by  the  continuity  of  / (x),  there  exists  8>0  such  that  g  (y  +  t\d)Td  <  0  for  any 


6 


H 


1 1  €  [0,5].  Thus  for  any  stepsize  t  <5, 

f(y  +  td)<f(y).  (1.14) 

Therefore  y  cannot  be  a  local  minimizer  of  /  (x ). 

Directions  d  for  which  g(y)Td  <  0  are  called  descent  directions  for  /  at  y.  Descent  directions  play 
an  important  role  in  die  global  methods  for  unconstrained  optimization  discussed  in  Section  4. 

The  above  argument  with  d  =  g  (x • )  also  shows  that  g (x. )  =  0  is  necessary  for  x.  to  be  a  local  max¬ 
imizer  of  / (x).  To  distinguish  between  minimizers  and  maximizers  it  is  necessary  to  consider  the  second 
derivative  matrix  //(*•)  defined  in  (1.9).  First  we  need  the  definition  of  a  positive  definite  matrix. 

Definition  1.1  Let  H  €  1R"**  be  symmetric.  Then  H  is  positive  definite  if  vTHv  >  0  for  all  nonzero 
v  €  R". 

There  are  several  equivalent  characterizations  of  positive  definite  matrices;  another  common  one  is 
that  a  symmetric  matrix  H  is  positive  definite  if  and  only  if  all  of  its  eigenvalues  are  positive.  If  vTHv  1 0 
for  all  v ,//  is  said  to  be  positive  semi-definite.  A  negative  definite  or  negative  semi-definite  matrix  is  one 
whose  negative  is  positive  definite  or  positive  semi-definite. 

Theorem  1.2  Let  / (x) ;  R“— »/?  be  twice  continuously  differentiable,  and  let  x»  elR".  If  g(x-  )=0  and 
H  (x* )  is  positive  definite,  then  x.  is  a  local  minimizer  of  /  (x ). 

Proof :  By  (1.12)  foranydeR", 

/(x»  +  d)»f  (x«)  +  g(x-)Td  +  '/bdTH(x»  +td)d  (1.15) 

for  some  t  e  (0,1).  By  the  continuity  of  /  (x)  and  the  positive  definiteness  of  H(x» ),  there  exists  5>0  such 

that  for  any  direction  d  with  lid  II  <5  and  any  scalar  t  with  If  I  SI,  H(x»  +  td)  is  positive  definite.  Thus 

for  any  d  with  lid  II  <5,  we  have  from  (1.15)  and  g (x.  )=0  that 

/(x.  -t-d)  >/(x.) .  (1.16) 

Therefore  x«  is  a  local  minimizer  of  /  (x ). 

By  a  similar  argument  it  is  easy  to  show  that  a  necessary  condition  for  x>  to  be  a  local  minimizer  of  a 
twice  continuously  differentiable  /  (x )  is  that  g  (x. )  -  0  and  H  (x. )  is  positive  semi-definite;  in  this  case,  it 


rarvs.vs > .r.mrw/.Viflambv 


is  necessary  to  examine  higher  order  derivatives  to  determine  whether  x.  is  a  local  minimizer.  If  g  (x. )  =  0 
and  //(x.)  is  negative  definite,  then  the  above  argument  shows  that  x*  is  a  local  maximizer  of  /(x).  If 
g(x.)  «  0  and  W(x.)  has  both  positive  and  negative  eigenvalues,  then  x.  is  said  to  be  a  saddle  point  of 
/  (x).  A  saddle  point  is  a  local  minimizer  of  some  cross-section  of  /(x)  and  a  local  maximizer  of  some 
other  cross-section. 

1.4  Rates  of  Convergence 

Most  algorithms  for  nonlinear  optimization  problems  in  continuous  variables  are  iterative.  They 
generate  iterates  x*  €  IR",  k  =  0,1,2,  •  •  •  which  are  meant  to  converge  to  a  solution  x. .  In  this  case,  the 
rate  of  convergence  is  of  interest.  This  subsection  reviews  the  rates  of  convergence  that  are  important  in 
continuous  optimization. 

Definition  1.2  Let  x*  e  IR",  it  =0,1,  -  -  • .  Then  the  sequence  {xk}  is  said  to  converge  to  a  point 
x.  e  IR"  if  for  every  i ,  the  i*  component  (x*  -x. ),  satisfies 

4lim_  (x*-x.)i=0.  (1.17) 

If,  for  some  vector  norm  II  -  1 1,  there  exist  K  £0  and  o  €  [0,1)  such  that  for  all  ktK, 

llx*+i -x.  II  So  llx* -x»  II ,  (1.18) 

then  {xt}  is  said  to  converge  q-linearly  to  x*  in  the  norm  II  -  11.  If  for  some  sequence  of  scalars  {a*}  that 

converge  to  0 

llx*+i-x.  II  So*  llx* -x.  II ,  (1.19) 

then  {xt}  is  said  to  converge  q-superlinearly  to  x. .  If  (xk)  converges  to  x»  and  there  exist  AT  £ 0  and  o£0 

such  that  for  all  kiK, 

llxt+1— X.  II  So  llx»  -X.  II2  (120) 

(Xk)  is  said  to  converge  q-quadratically  to  x« . 

Note  that  a  sequence  may  be  linearly  convergent  in  one  norm  but  not  another,  but  that  superlinear 
and  quadratic  convergence  are  independent  of  the  choice  of  norm  on  IR" . 
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Most  methods  for  continuously  differentiable  optimization  problems  are  locally  q-superlinearly  or 
locally  q-quadratically  convergent,  meaning  that  they  converge  to  the  solution  x>  with  a  superlinear  or 
quadratic  rate  if  they  are  started  sufficiently  close  to  x* .  In  practice,  local  quadratic  convergence  is  quite 
fast  as  it  implies  that  die  number  of  significant  digits  in  x*  as  an  approximation  to  x.  roughly  doubles  at 
each  iteration  once  x*  is  near  x. .  Locally  superlinearly  convergent  methods  that  occur  in  optimization  also 
are  often  quickly  convergent  in  practice.  Linear  convergence,  however,  can  be  quite  slow,  especially  if  the 
constant  a  depends  upon  the  problem  as  it  usually  does.  Thus  linearly  convergent  methods  are  avoided 
wherever  possible  in  optimization  unless  it  is  known  that  a  is  acceptably  small.  It  is  easy  to  define  rates  of 
convergence  higher  than  quadratic,  e.g.  cubic,  but  they  play  virtually  no  role  in  practical  algorithms  for 
multi-variable  optimization  problems. 

The  prefix  q  preceding  the  words  linear,  superlinear,  or  quadratic  stands  for  “quotient”  rates  of  con¬ 
vergence.  This  notation,  commonly  used  in  the  optimization  literature,  is  used  to  contrast  with  r  (“root”) 
rates  of  convergence,  which  are  a  weaker  form  of  convergence.  A  sequence  is  r-linearly  convergent,  for 
example,  if  the  errors  I  lx*  -x.  II  are  bounded  by  a  sequence  of  scalars  {rk}  which  converge  q  -linearly  to 
0.  The  ellipsoidal  algorithm  for  linear  programming  (Khachiyan  [1979])  is  a  perfect  example  of  an  r- 
linear  method  since  the  radii  of  the  ellipses  are  a  sequence  of  error  bounds  that  converge  q  -linearly  to  zero. 
Similar  definitions  apply  to  other  r  -rates  of  convergence;  for  further  detail  see  Ortega  and  Rheinboldt 
[1970].  Thoughout  this  book,  if  no  prefix  precedes  the  words  linear,  superlinear,  or  quadratic,  q  -order  con¬ 
vergence  is  assumed 
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2.  Newton’s  Method 


Among  nonlinear  optimization  researchers,  there  is  a  strong  suspicion  that  if  any  iterative  method  for 
any  problem  in  any  field  is  exceptionally  effective,  then  it  is  Newton’s  method  in  some  appropriate  context. 
This  is  not  to  say  that  Newton’s  method  is  always  practical. 

In  this  section  we  derive  the  Newton  iteration  for  nonlinear  equations  through  the  affine  or  linear 
Taylor  series  model  (1.3)  in  subsection  2.1,  and  the  Newton  iteration  for  unconstrained  optimization 
through  the  corresponding  quadratic  model  (1.7)  in  subsection  2.2.  We  give  a  rigorous,  but  intuitive, 
analysis  of  the  local  convergence  of  the  corresponding  iterative  method,  called  the  Newton  or  Newton- 
Raphson  method,  and  set  the  stage  for  a  discussion  of  some  clever  and  effective  modifications.  The  first  of 
these  modifications  is  controlled  inaccuracies  in  the  solutions  of  the  model  problems.  The  resulting  inexact 
Newton  method  is  the  topic  of  subsection  2.3,  and  it  can  greatly  reduce  the  expense  of  computing  Newton 
steps  far  from  the  solution.  This  is  especially  important  for  large  problems,  since  then  the  Newton  step 
itself  is  likely  to  be  computed  by  an  iterative  method.  Some  additional  modifications  are  discussed  Section 
3. 


2.1.  Newton’s  method  for  nonlinear  equations 

The  underlying  principle  in  most  of  continuous  optimization  is  to  build,  at  each  iteration,  a  local 
model  of  the  problem  which  is  valid  near  the  current  solution  estimate.  The  next,  improved,  solution  esti¬ 
mate  is  gotten  at  least  in  part  from  solving  this  local  model  problem. 

For  nonlinear  equations,  the  local  model  is  the  Taylor  series  model  (1.3).  The  basic  Newton  iteration 
can  be  written  as  the  root  of  this  model, 

X+  =  Xc-J(Xc)-'  F(xc)  ■ 

But  in  keeping  with  the  local  model  point  of  view,  we  prefer  to  think  of  it  as: 

Solve  J  )s?  =  -F(xc) 

.  V  (211) 

and  set  x+  =  xe+s?. 


where  s?  denotes  the  current  Newton  step,  because  this  results  directly  from 


0«F(*  )+■/(*)(*-*). 


(2.1.2) 


Notice  that  we  could  also  have  solved  directly  for  x+  from  (2.1.2),  giving 

J(xe)x+  =  -F(xc)+J(xe)xc  .  (2.1.3) 

In  exact  arithmetic,  these  formulations  are  equivalent  as  long  as  the  columns  of  J  (xc)  span  IR’“0'.  In 
computer  arithmetic,  (2.1.1)  generally  is  preferable.  The  reason  is  that  in  practice,  we  expect  the  magnitude 
of  the  error  in  solving  a  linear  system  Az  =  b  by  a  matrix  factorization  and  backsolve  technique  to  be 
estimated  by 

-■flJfl11  =itHAII  MA-|ll  * M-k(i4 ) , 

where  f  is  the  computed  solution,  the  components  of  A  and  b  are  machine  numbers,  and  p  is  the  quantity 
known  as  ‘machine  epsilon’  or  ‘unit  rounding  error’,  i.e.,  the  smallest  positive  number  such  that  the  com¬ 
puted  sum  of  1  +  p  is  different  from  1.  Since  we  generally  expect  s?  to  be  smaller  than  x+,  we  expect 
in  (2.1.1)  to  be  smaller  than  llx+-x+ll  in  (2.1.3).  Of  course,  we  commit  an  error  of  magni¬ 
tude  pllx*  II  when  we  add  scN  toxe  togetxe  in  (2.1.1),  but  this  does  not  involve  Kf/fXc)).  Thus,  we  will 
think  always  of  (2.1.1)  as  Newton’s  method  with  the  linear  system  solved  using  an  appropriate  matrix  fac¬ 
torization.  This  discussion  is  relevant  for  all  the  methods  based  on  Taylor  series  type  models.  For  a  dis¬ 
cussion  of  the  computer  solution  of  linear  equations,  see  Stewart  [1973]  or  Golub  and  Van  Loan  [1983]. 

The  main  theorem  of  this  section  is  given  below.  We  use  the  notation  N(x,S)  to  denote  the  open 
neighborhood  about  x  of  radius  S. 

Theorem  2.1.1.  Let  F  :  1R"  -» IR"  be  continuously  differentiable  in  an  open  convex  set  D  clR*.  If  there 
exists x»eD  and  r,p,  >  Osuch  that  J(x)e  Lip^N(x»  ,r)),  F (x»)= 0,  and  ll/fx.)'1  II  £P,  then  there  exists 
e  >  0  such  that  for  every  xo  e  N  (x. ,  e)  the  sequence  [x* }  generated  by 

x*+i  =x* -J(xk)-'F(xk),  k- 0, 1,  •  •  • 

is  well  defined,  converges  to  x* ,  and  satisfies 

llx*+1-x.  II  spyllxt-x.  II2. 

Proof:  Since  Lipschitz  continuity  implies  continuity,  and  since  the  determinant  is  a  continuous  function  of 
the  entries  of  a  matrix,  it  is  easy  to  see  that  we  can  assume  without  loss  of  generality  that  J  (x )  is  invertible 
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for 

xeN(x.,r)  and  that  1 1  y  (j:  )-1 1 1  S2J3. 

Thus,  if  x  e  N  (x. .  e)  for  e  £  min/> ,  (2(3y)'1J  ,  x*+i  exists  and 

xt+i-x.  =  x*-J(x*)-‘F(x*)-x.  +y(xt)_1F(x.) 

=  /(•**)"'  [F(x*)-F(xik)-y(x*)(x.  -x*)] . 

Using  (1.6),  this  gives 

llx^i-x.  II  £  ll/(x*)-'ll-  IIF(x.W(**Xx*  -xt)ll 
<2p-J  llx4-x.  II2 ^ PyHx*  -x.  II2 

£  PyeIIxj  -x.  II  £  -yllx*  -x.  II  , 

which  establishes  both  convergence  and  the  quadratic  rate  of  convergence,  and  completes  the  proof. 

Notice  that  we  could  apply  Newton’s  method  to  C(x)=J(x>)~'F(x),  for  which  G(x-)  =  0, 
Ja(X')mG  \x.)=l ,  and  7C (x ) e  Lip^yV  (x. , r ),  and  the  identical  sequence  of  iterates  and  convergence 
analysis  would  result  This  emphasizes  one  of  the  major  advantages  of  Newton's  method:  it  is  invariant 
under  any  affine  scaling  of  the  independent  variable.  Furthermore,  note  that  the  Lipschitz  constant  of 
•JgGO.Py.  is  a  relative  measure  of  the  nonlinearity  of  F  as  given  by  the  change  in  J  scaled  by  J(x-  )_1. 
This  is  satisfying  since  we  would  like  to  believe  that  F  is  no  more  or  less  nonlinear  if  we  change  our  basis. 

In  section  3.3,  we  give  an  example  of  the  application  of  Newton’s  method  to  a  specific  problem.  It  is 
compared  there  to  Broyden’s  method  which  is  superlinearly  rather  than  quadraticly  convergent,  but  which 
does  not  share  two  major  disadvantages  of  Newton’s  method.  Broyden’s  method  does  not  require  the 
expensive  and  error-prone  computation  of /(x)  or  the  finite  difference  approximation  discussed  in  Section 
32,  and  it  reduces  the  be  successful  to  solve  the  linear  system  (2.1.1)  for  the  Newton  step  from  0(n3)  to 
Ofn2)  operations.  Broyden’s  method  shares  the  third  major  disadvantage  of  Newton’s  method  in  that  both 
require  a  good  initial  pess  xo  to  guarantee  success.  Because  of  this,  they  are  referred  to  as  local  methods. 
and  must  be  augmented  by  the  techniques  discussed  in  Section  4  to  be  successful  from  poor  starting  points. 
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22.  Newton’s  Method  for  Unconstrained  Optimization 


In  this  section,  we  look  at  Newton’s  method  for  the  unconstrained  minimization  problem  (1.1).  It 
can  be  derived  by  applying  the  basic  iteration  (2.1.1)  to  the  first  order  necessary  condition  g(x)=0  for 
unconstrained  optimization  given  by  Theorem  1.1.  This  gives  the  basic  Newton  iteration 


W(ze)seA'  =  -g(ze) 

X+^Xc+S? . 


(2.2.1) 


The  Newton  iteration  (2.2.1)  is  equivalent  to  solving  for  a  zero  of  the  gradient  of  the  Taylor  series 


quadratic  model  m?(xc  +s)  given  by  (1.7).  If  H  fa )  is  positive  definite,  then  x+  is  a  local  minimizer  of  m? 


by  Theorem  1.2.  In  fact,  it  is  the  global  minimizer  of  the  model  m?  since  x.  is  the  only  zero  of  Vm  m. 
This  leads  to  a  very  satisfying  interpretation  of  Newton's  method  for  (1.2)  when  Hc  is  positive  definite. 
We  construct  a  uniformly  convex  quadratic  model  m?  of  /  by  matching  function,  gradient,  and  curvature 
to  /  at  Xc ,  and  then  step  to  the  global  minimum  of  the  model.  Furthermore,  since  H(xc)  is  positive 
definite,  then  the  Cholesky  factorization  can  be  used  to  solve  for  rcv  in  (2.2.1).  We  will  see  another  advan¬ 
tage  in  Section  4.1.  If  //(x^)  is  not  positive  definite,  however,  then  we  have  no  reason  to  believe  that  (2.1.1) 
may  not  lead  towards  a  saddle  point  or  even  a  maximizer. 

Newton’s  method  has  all  the  same  disadvantages  for  unconstrained  minimization  that  it  had  for  non¬ 
linear  equations.  That  is,  it  is  not  necessarily  globally  convergent,  and  it  requires  0 (n3)  operations  per  itera¬ 
tion.  In  addition,  we  need  H  (xc )  to  be  positive  definite  rather  than  just  nonsingular.  Furthermore,  both  first 
and  second  partial  derivatives  are  required.  We  will  address  these  issues  in  Sections  3  and  4. 

We  finish  the  subsection  by  stating  a  convergence  theorem  for  (2.2.1)  that  follows  directly  from 
Theorems  1.2  and  2.1.1. 

Theorem  2.2.1.  Let  / :  1R*  -» 1R  be  twice  continuously  differentiable  in  an  open  convex  set  D  cIR". 
Assume  there  exists  x»  e  D  and  r,  J3  >  0  such  that  g(x»)= 0,  H{x»)  is  positive  definite,  ll//(x. )-1  II  < p, 
and // (x ) €  Lip-jN(x- , r ).  Then,  there  exists  e>0  such  that  for  every  x<>e  N(x.  ,e),  the  sequence  (x* )  gen¬ 
erated  by 


Xk*i=xt-H(xt)-'g(xt) 


m*i  «.l 


is  well  defined,  converges  to  the  local  minimizer  x-  of  / ,  and  satisfies 


Hxt+i-x*  II  £  Py  1 1  x*  -x«  II2  . 


23.  The  Inexact  Newton  Method 


This  section  deals  with  a  topic  of  importance  in  the  use  of  Newton’s  method,  panicularly  for  large 
dimensional  problems.  In  such  cases,  iterative  methods  like  the  Gauss-Seidel  or  conjugate  direction 
methods  are  often  the  only  practical  way  to  solve  the  linear  system  (2.1.1)  or  (2.2.1)  for  the  Newton  step. 
The  natural  question  is  “how  does  the  inaccuracy  in  the  solution  of  (2.1.1)  for  the  Newton  step  affect  the 
rate  of  convergence  in  the  exact  Newton  method.” 

An  analysis  by  Dembo,  Eisenstat  and  Steihaug  [1982]  leads  to  the  following  result. 

Theorem  2_3.1.  Let  the  hypotheses  of  Theorem  2.1  hold,  and  let  [t)*  )  be  a  sequence  of  real  numbers  satis¬ 
fying  OSq*  £t|  <  1.  Then  for  some  e>  0,  and  any  x<>€  N(x. ,  e),  the  inexact  Newton  method  correspond¬ 
ing  to  any  [xk }  defined  by 


F  \xk)sk  =-F(Xk)+rk,  where  5  ’l* 


(2.3.1) 


X*+l  =  X*+5* 

is  well  defined  and  converges  at  least  q  -linearly  to  x. ,  with 


IIxa+i-x*  II  £qllx*-x.  II  . 

If  {*!*)  converges  to  0.  then  (x* }  converges  superlinearly  tox. .  If  {q* )  is  0(  { 1 1 F  (x* )  1 1* )  forO<p  <  1, 
then  (x* }  converges  tox.  with  q -order  1  +p. 

For  example,  this  result  guarantees  quadratic  convergence  if  at  each  iteration  the  approximate  New¬ 


ton  step  sk  satisfies 


IIF(xa)-F'(x*)s*II 


<  min  (1,  IIF(x*)ll)  . 


(2.3.2) 


The  quantity  on  the  left  of  (2.3.2),  sometimes  called  the  relative  residual,  is  usually  monitored  in  practice 
when  solving  a  linear  system  by  iteration.  Thus,  the  analysis  gives  a  convenient  stopping  rule  for  the  linear, 
or  inner,  iteration  for  solving  the  linear  problem  (2.1.1)  which  is  embedded  in  the  outer  iteration  for  solv¬ 
ing  the  nonlinear  problem  (1.2).  All  these  results  also  apply  to  (he  unconstrained  optimization  problem. 
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Notice  that  Theorem  2.3.1  is  for  an  iteration  (2.3.1)  in  which  every  quantity  can  be  computed  if  F ' 
and  F  can  be.  In  the  next  section,  we  will  consider  the  effects  of  using  approximations  to  the  Taylor  series 
model.  There  are  strong  connections  between  the  analysis  of  those  approaches  and  the  inexact  Newton 
method.  In  particular,  for  unconstrained  optimization,  (2.3.1)  is  equivalent  to  a  statement  about  the  inaccu¬ 
racy  in  an  approximation  to  the  gradient  V/  (xk). 


V 
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3.  Derivative  Approximations 


In  section  2,  we  developed  the  local  theory  for  Newton’s  method  based  on  the  appropriate  Taylor 
series  model.  In  practice,  users  prefer  routines  that  approximate  the  required  derivatives  either  by  finite 
differences  or  by  some  of  the  clever  and  effective  multi-dimensional  secant  methods.  In  this  section,  we 
will  consider  several  instances  of  these  approximations.  They  all  involve  using  local  models  of  the  form 


Me(xe+d)  =  F(xt)+Bed 


(3.1) 


to  solve  (1.2)  or  of  the  form 


ffle(Xe  +d)=f(Xc)  +  g(Xc)Td  +  -jdTBcd 
to  solve  (1.1).  We  call  these  quasi-Newton  models. 


(3.2) 


In  Section  3.1,  we  will  see  the  surprisingly  simple  load  convergence  analysis  for  any  quasi-Newton 
iterative  method 


Be  si-"  =  -F(xc)  or  Bc  s'-"  * -g(xc) 
x*~xc  +s<~M 

based  on  these  local  models.  Section  3.2  will  present  the  finite-difference  Newton  method. 


(3.3) 


Sections  3.3  and  3.4  will  be  devoted  to  some  popular  multi-dimensional  secant  methods  for  choosing 
Be.  We  will  concentrate  on  Broyden’s  method  far  nonlinear  equations  and  the  BFGS  method  for  uncon¬ 
strained  optimization.  Finally,  in  Section  3.5,  we  will  consider  the  incorporation  of  sparsity  into  the 
approaches  for  approximating  derivatives,  including  some  techniques  to  save  significantly  on  the  cost  of 
obtaining  Bc  by  finite  differences  when  /  (Xc )  has  a  known  sparsity  structure. 

Since  we  will  deal  with  sparsity  later  in  this  section,  it  seems  appropriate  to  introduce  this  important 
practical  property  here.  In  general,  we  say  that  a  matrix  is  sparse  if  few  of  its  entries,  say  less  than  15%. 
are  nonzero.  For  nonlinear  systems  of  equations,  J(x)  is  sparse  if  most  of  the  component  equations  F, 
involve  very  few  of  the  unknowns  x,.  For  unconstrained  minimization,  H{x)  is  sparse  if  most  variables  do 
not  interact  in  the  sense  that  most  of  the  cross  partial  derivatives  arc  zero. 

In  practice,  the  sparsity  of  J(x)  or  H(x)  is  independent  of  the  value  of  x,  and  this  allows  for  some 
important  savings  at  every  iteration  in  solving  for  the  Newton  step  in  (2.1.1).  Naturally,  we  wish  to  incor- 
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porate  such  a  useful  property  into  Bc  in  order  to  have  the  same  savings  in  (3.3).  Furthermore,  it  seems 
intuitive  that  we  should  be  able  to  more  efficiently  approximate  J(x)  or  //(x)  if  we  know  that  most  of  its 
entries  are  zero. 

To  some  extent,  this  is  true.  However,  sparsity  is  entirely  a  property  of  the  basis  used  to  represent  x , 
and  so  we  may  make  our  approximation  methods  more  dependent  than  before  on  the  specific  representation 
of  the  problem.  This  can  lead  to  practical  difficulties  of  the  sort  that  are  usually  called  ‘bad  scaling’. 


3,1.  General  convergence  theory 


In  this  section,  we  will  consider  general  local  models  of  the  forms  (3.1)  and  (3.2),  and  we  will  com¬ 
pare  them  to  the  appropriate  Taylor  series  or  Newton  model  to  obtain  one-step  local  convergence  estimates 
for  the  associated  iterative  methods  (3.3). 


Lemma  3.1.1.  Let  F  satisfy  the  hypotheses  of  Theorem  2.1.1  and  let  Mc(xc+d)  be  given  by  (3.1).  Then 
for  any  d , 


IIF(xe  +d)-Me(xc  +d)ll  S  -J-  lid  !IJ+  \\[BC-F\xc)}d\\ 


(3.1.1) 


In  addition,  if  Bc~l  exists  and  ee  =x»  -Xc ,  e+=x«  -x*.  for  x*.  defined  by  (33),  then 


lle+ll  S  IIBc-1  II 


S  llflt-Ml 


-Jll«ell 


II [#c  -J(*c)]ec  II 


II ec  II 


-JllCc  II  +  IISC  -/(xe)ll 


1 1  ec  1 1 


(3.1.2) 


llrc  II  . 


Proof:  First,  we  add  and  subtract  the  Newton  model  (1.3)  to  obtain 


IIF(xe+d)-Af<:(Xc+d)ll  =  \\F(.x€+d)±M?(.xt+d)-F(xt)-Bed\\ 

5  IIFfe  +d)-Af/r(xe  +d)ll  +  \\F(xt)+J(x<)d -Me(Xt  +d)ll  . 

This  says  that  the  error  in  the  model  (3.1)  is  bounded  by  the  sum  of  the  error  in  the  Newton  model  and  the 
difference  between  the  Newton  and  quasi-Newton  models.  Now  we  apply  (1.6)  to  the  first  term  and  sim¬ 
plify  the  second  term  to  obtain  (3.1.1). 

To  obtain  (3.1.2),  we  use  F(x>  )=0  and  (3.3)  to  write 
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X+-X-  *Xc  -X.  -*,"1  [F(xt)-F(x.)] 

«*,-'[*(* +et)-(F(xf)+*ce,)] 


so 


llejl  Z  II*,-1  II-  UF(xt+et)-Mc{xc  +e«)ll 
and  (3.1.2)  follows  from  (3.1.1). 

The  convergence  theorems  of  Section  2  for  Newton's  method  follow  very  simply  from  (3.1.2)  with 
Be  *7(xe).  Indeed,  we  have  reduced  the  analysis  of  any  quasi-Newton  method  (3.3)  to  an  analysis  of  the 
coi.esponding  Jacobian  or  Hessian  approximation  rule.  There  are  three  especially  important  properties  to 
consider. 

First,  sometimes  the  error  in  *,  as  an  approximation  to  J  (Xe )  is  controlled  by  a  parameter  0  in  the 
approximation  rule;  we  will  see  that  the  finite  difference  Newton  method  fits  this  mold.  In  such  rules,  it  is 
sometimes  the  case  that  if  the  sequence  of  iterates  (x* )  is  well  defined  and  converges  to  x. .  then  [Bk )  con¬ 
verges  to  7(x>).  Such  rules  are  said  to  provide  consistent  Jacobian  or  Hessian  approximations.  (See 
Ortega  and  Rheinboldt  [1970].)  The  reader  will  immediately  see  from  (3.1.2)  that  consistent  methods  are 
at  least  q-superlinear.  Of  course,  Newton's  method  is  consistent  and  q  -quadratic,  if  J  (x)  is  Lipschitz  con¬ 
tinuous. 

Second,  Dennis  and  More  [1974]  prove  that  a  quasi-Newton  method  that  is  convergent  with 
sufficient  speed  that  ^  llet  II  <••  (r -linear  is  sufficient),  will  be  q  -superlinear  if  and  only  if  the  approxi¬ 
mations  are  directionally  consistent  in  the  sense  that 

^jm[*t-/(x»)]-rTgTr-0.  (3.1.5) 

The  sufficiency  of  (3.1.5)  for  superlinearity  is  obvious  from  (3.1.2).  The  proof  of  necessity  is  a  bit  harder. 

These  two  consistency  properties  are  useful  especially  for  analyzing  rates  of  convergence  after  con¬ 
vergence  of  (x* )  has  been  established.  It  is  clear  that  convergence  can  be  established  with  much  less.  In 
particular,  we  see  from  (3.1.2)  that  if  we  could  ensure  that  for  every  k  such  that  xo,  ...,x*  is  defined,  that  ** 
would  be  defined  and 


II**-1  II  *3.  II**  -/(x*)ll  s  5,  and  35  <  1  . 


(3.1.61 


4 

c- 

4 


then  we  could  get  convergence  by  starting  so  close  that 


a*fj[Yll«oll/2  +  8]  <  1 . 

This  would  ensure  that  ll«i  II  £alleoll  <  Heoli  and  local  q -linear  convergence  would  follow  by  induc¬ 
tion. 

A  useful  property  of  the  Jacobian  or  Hessian  approximation  rule  in  obtaining  (3.1.6)  is  called 
bounded  deterioration.  There  are  various  useful  forms,  but  the  one  needed  to  analyze  the  Broyden,  PSB, 
Schubert,  and  DFP  secant  methods  is  from  Dennis  and  Walker  [1981],  pg.  981  where  the  reader  can  find  a 
proof  of  the  next  two  theorems. 

Theorem  2.12.  Let  F  satisfy  the  hypothesis  of  Theorem  2.1.1,  and  let  B.  e  1R"*"  have  the  property  that 
B._1  exists  and  for  some  operator  norm 

\r-B.~'FXx.)\  Sr.  <  1. 

Let  U:  1R"  x R“K" -» 2®“  be  defined  in  a  neighborhood  N  =NxxN2  of  (x-  ,B-)  where  N,  cfl  and  ,V2 
contains  only  nonsingular  matrices.  Assume  that  there  are  nonnegative  constants  ai  and  a2  such  that  for 
each  (x,B)€  N.and  for  x*=x  -B_1F(x),  every  B*€  U(x,B)  satisfies 

IIB+-B.  II  S  [l  +  a1o(x,x..)r]  •  IIB  -B.  II  +  a2o(x,x*)e 
for  ct(x ,x+)= max  { lx  -x.  I,  lx+-x-  I  ]. 

Under  these  hypotheses,  for  any  r  €  (r. ,  1),  there  exist  constants  cr,  5.  such  that  if  lx0-x.  I  <  er 
and  IBo-fl-  I  <  5r,  then  any  iteration  sequence  (x* )  defined  by 

x**,-x*-B*-'F(x*),  Bn*\  €  Ufa  fl*), 

A  * 0,1 . exists,  converges  q  -linearly  to  x>  with 

lx**i  -x.  I  S  r  •  lx»  -x.  I , 

and  has  the  property  that  ( IB*  I )  and  ( IB*-1 1 )  are  uniformly  bounded. 

There  are  some  important  ways  in  which  bounded  deterioration  may  be  weaker  than  consistency.  In 
particular,  the  essence  of  the  method  might  be  to  make  B*  not  be  too  much  like  J (x* ).  but  to  be  more  con¬ 
venient  to  use  in  (3.3).  For  example,  in  the  nonlinear  Jacobi  iteration.  B*  is  taken  to  be  the  diagonal  of 
J(to)  so  that  r/~*  defined  by  (3.3)  only  costs  n  divisions.  Of  course,  the  Jacobi  iteration  will  not  converge 


unless  the  partial  derivatives  on  the  diagonal  dominate  the  rest  of  A  standard  sufficient  condition  for 
this  which  implies  (3.1.6)  in  the  /.  norm  is  called  strict  diagonal  dominance.  See  Ortega-Rheinboldt 
[1970].  This  brings  up  another  point;  the  key  to  a  convergence  analysis  for  a  specific  method  is  often  in 
choosing  the  proper  norm. 

Finally,  some  methods,  like  the  BFGS  secant  method,  are  more  readily  analyzed  by  thinking  of  them 
as  directly  generating  approximations  to  the  inverse.  The  following  theorem  from  Dennis  and  Walker 
[1981],  pg.  982  gives  an  inverse  form  of  the  bounded  deterioration  principle. 

Theorem  3.1-3.  Let  F  satisfy  the  of  Theorem  2.1.1,  hypothesis  and  let  A",  be  an  invertible  matrix  with 
\I-K.F'(x.)\Zr.  <1. 

Let  U:  1R"  x  lR"**  ->2*“"  be  defined  in  a  neighborhood  N  xN2  of  ( x •  ,Af. ),  where  N i  c£l. 
Assume  that  there  are  nonnegative  constants  ai,a2  such  that  for  each  (x,K)  in  N ,  and  forx+=x -KF(x), 
the  function  U  satisfies 

II K+-K.  II  £[l  +  a1c(x,jr+)',]lltf-/:.  Il+a2<r(x,x+y 
for  each  AT+e  U(x,K).  Then  for  each  r  e  (r.  ,1)  there  exist  positive  constants  er,6r  such  that  for 
lxo-x.  I  <tr  and  \Ko-K- 1  < 5r , any  sequence  [x*]  defined  by 

x**i  =x*-/r*F(x*),  /r*+l€  U(xt,Kk) 

k  *0,1 . exists,  converges  ^ -linearly  to  x*  with  lx*+i -x.  I  Sr  lx*  -x«  I,  and  has  the  property  that 

{ UK*  I )  and  ( l/f*~'  I }  are  uniformly  bounded. 

3.2.  Finite-Difference  Derivatives 

In  Section  2  we  saw  that  the  local  models  furnished  by  the  appropriate  partial  Taylor  series  are  very 
useful  in  solving  continuous  optimization  problems.  Sometimes,  the  appropriate  partial  derivatives  are  not 
available  analytically,  and  other  times  the  user  is  not  willing  to  be  bothered  providing  them.  Thus,  most 
good  optimization  packages  provide  routines  for  estimating  partial  derivatives  by  finite  differences.  These 
routines  may  (and  should)  even  be  used  to  check  for  errors  in  any  derivatives  the  user  does  provide. 


uuu 
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In  this  section,  we  will  discuss  briefly  the  surprisingly  rich  topic  of  how  to  compute  finite-difference 
approximations  to  first  and  second  partial  derivatives.  We  will  discuss  aspects  of  mathematical  accuracy, 
numerical  roundoff,  convergence,  and  efficiency.  We  begin  with  accuracy. 

We  know  from  elementary  calculus  that 


Fi(x+hej)-Fi(x)  _  dFi(x) 

i*ia — h - 


where  e,  is  the  j*  column  of  the  n  xn  identity  matrix.  This  is  called  the,  forward  difference  approxima¬ 
tion  and  it  suggests  approximating  the  j *  column  of  J  (xe )  by 


AjF(xt.h)^-^-[F(xc+hjej)-F(xe))  (3.2.1) 

for  an  appropriately  chosen  vector  h  of  finite-difference  steps.  We  will  use  the  notation  AF  (x ,  h )  for  the 
Jacobian  approximation  whose  j *  column  is  given  by  (3.2.1). 


Lemma  3.2.1.  Let  INI  be  a  norm  for  which  lie;  II  =  1  and  let  F  e  LipyfD)  with  xc,Xc  +h,ej,j  =  1 . n 

all  contained  in  D .  Then 


UAJF(xc,h)-J(x<)eji\^^\h,\ 
Furthermore,  in  the  / 1  operator  norm. 


(3.2.2) 


'“"‘■B&t1*1* 


it  follows  that 


IIAF(xe,A)-/(xc)lllS-JllAIL 


(3.2.3) 


where  llh  1  is  the  /-  vector  norm. 


Proof:  The  proof  follows  easily  from  the  Taylor  series  remainder  (1.6),  since 


\\A,F(xc,h)-J(xe)el  II  =  I*;  I-1  MF(xe  +  A,e/)-F(*W(xc)V; 

=  I hj  I-1  •  II F(xc+hjej)-m?(xc  +hje,)\\ 

S  IA/|-'--Jllh/e;ll2=-JlA;l  . 

We  get  (3.2.3)  directly  from  (3.2.2)  since  1 1  e,  1 1 1  =  1  and  so 
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\\&F(Xc,h)-J  (Xc )  1 1 1  ■  {nitt  II  tSjF(xc,h)-J(xe)ei  II] 


Equation  (3.2.1)  also  is  used  to  approximate  the  gradient  by  the  forward  difference  approximation 

&/(*. h)  =  .1=1 . n.  (3.2.4) 

Although  this  forward  difference  approximation  is  generally  accurate  enough,  sometimes  central  differ¬ 
ences  are  useful  because  of  roundoff  considerations  we  will  discuss  later.  Since  central  differences  are 
most  often  used  for  gradient  approximations,  we  define  them  in  that  context. 


(3.2.5) 


and 


8/  (*e.fi)  =  (81/  (Xt,h) . 8  ./te./i))7-. 

Lemma  3.2.2.  Let  INI  be  a  norm  such  that  lle;  II  =  1.  and  let  He.  Lip-fJ) ),  withxe.Xc  +A,e,,  i  =  1 . n 

all  be  contained  in  D .  Then  8,/  (xe ,  h )  given  by  ( 32.5)  obeys 

I8./(*./0-!£-(xc)IS  (3.2.6) 

and 


H8/C*,/i)-g(xc)II.S  JllAlli. 

Proof:  Note  that  from  (1.7),  (1.8)  we  get 

If  (xe  +h,ei)~  m?(xc  +  h,e,)]-\f  (xc  -  h{  e, )  -  m?(xc  -  hi  e, )] 
=f(xe+hiei)-f(xc-hiei)-2hi-$L(xc). 

Thus,  from  (1.10)  and  the  triangle  inequality, 

l/(x*  +h,e,)-/  (xe  -h,ei)-2h,  (xe )  1 3  =  •£  I  A,  1 3 
from  which  (32.6)  follows  directly. 

Note  that  the  central  difference  gradient  is  more  accurate  than  the  forward  difference  gradient,  but 


that  it  requires  In  rather  than  n  evaluations  of  /(x)  if  we  assume  that  /  (xc)  is  already  available. 


If  the  gradient  is  obtained  analytically,  but  the  Hessian  is  to  be  approximated  by  finite  differences, 
then  it  can  be  approximated  by  applying  (3.2.1)  to  g(x),  obtaining  the  approximation  Ag(x^,h ).  However 
the  matrix  Ag(xc,h)  would  not  be  symmetric  although  H(xe)  would  be.  In  this  case,  a  sensible  strategy 

would  be  to  use  =  •j-[Ag(Xe)+AgCOr]  as  the  Hessian  approximation.  Some  theoretical  justification 

for  this  comes  from  noting  that  fie  is  the  Frobenius  norm  projection  of  Agfa)  into  the  subspace  of  all  sym¬ 
metric  matrices.  Thus,  from  the  Pythagorian  Theorem, 

NW(*e)-fiellF  *  Ntf«-Ag(*.A)llf 

where 

IIA  Ilf  ■  (£  la,/ l2)^  (3.2.7) 

>j 

is  the  Frobenius  norm.  This  norm  will  be  useful  later  when  we  again  want  an  inner  product  structure  on  the 
vector  space  of  real  matrices. 

It  is  also  possible  to  obtain  an  approximate  Hessian  using  (n2+3n)/2  evaluation  of/(x).  By  expand¬ 
ing  the  Taylor  series  through  third  order  terms,  a  stronger  version  of  (3.2.9-10)  can  be  proven  in  which  the 
I  hi  1 2/ 1  Ay  I  and  IAyl2/IA,  I  terms  are  removed,  and  the  constant  is  different 


Lemma  3.2.3.  Let  INI  be  a  norm  such  that  lie;  II  =  1  and  let//  zLip-^D)  withxe,  +  Aje;,Xe  +hje,, 
andXe  +hiti  +  hjej,i,j  =  1,  •  •  -n  all  contained  in D . 


Ljj  [Hc  ]  ■  f  +  hj  e,)— f  (xc  +  htCj)  f  (xc  •f/i,e/)+/  (xc ) 


(3.2.8) 


'  ^ 
l[W«]v-[ff(xe)l.>l  S-J  2  -pp-|-  +  3 1  hi  I  +  3 1  A;  I  +  2  • 

^  - 


(3.2.9) 


If  the /i,/_  or  Frobenius  norm  is  used,  then 


IIHe-H(x,)ll  max  2^^  +  3111, 1 +3IA;  I +2-!,^,-  .  i 

>  ^  4 

Proof:  The  proof  is  very  much  like  the  previous  proof.  Let  s,  -  hi  eitsj=  h,  e,  and  sk,  =  s,  +  s, ,  then 


(3.2.10) 


[f  Uc  +  Sij )  -  m?(xe  +  s,j  )]-(/■  (xe  +  Si ) -  m?(xc  +s,)]-[f  (xe +Sj)~  m?(Xc  +  s, )] 
*  /  (Xc  +  Si,  )  -f  (Xe  +  Si )  -f  (Xe  +  S,  )  +/  (x*  )  -  A,  A,  [I!  (xe  )],y  . 


From  the  triangle  inequality  and  (1.10),  we  get 

I hihj[He]-hihj[H(xt)]lj  I  *  £  [llJi,  ll3+  lls,  Il3  +  II  Jy  II3] 

<  [(  I  I  i,  II  +  Hi,  ll)3+  lls.  Il3+  IIS;  II3] 

s£[(IA,  l  +  IAyl)3+IA«l3+IA,  I3] 

from  which  (3.2.9)  and  (3.2.10)  follow. 

Now  we  have  seen  some  useful  rules  for  approximating  derivatives  by  differences  of  function  values, 
and  we  have  analyzed  the  accuracy  of  these  approximations.  So  far,  it  seems  as  though  the  obvious  thing 
to  do  is  to  wt  oose  the  vector  A  to  be  very  small  in  order  to  make  the  approximate  derivatives  more  accu¬ 
rate.  The  difficulty  is  that  these  approximation  rules  are  certain  to  lose  more  accuracy  due  to  finite- 
precision  cancellation  errors  as  A  becomes  smaller. 

In  section  2.1,  we  introduced  machine  epsilon  p  as  the  smallest  positive  quantity  for  which  the 
floating-point  sum  1  +p  would  be  different  from  the  floating  point  representation  of  1.  Thus,  even  if  we 
ignore  the  fact  that  all  functions  are  to  be  evaluated  in  finite-precision,  we  see  that  I  A,  I  must  be  large 
enough  that  xc  +h,ej  is  not  the  same  as  xc  in  finite  precision  arithmetic,  or  else  the  numerator  in  the  for¬ 
ward  difference  would  be  zero.  This  means  that  IAy  I  2pl[xc]y  I. 

If  we  remember  that  function  values  are  computed  in  finite  precision  with  t  digits  then  we  can 
believe  that  if  we  obtain  a  value  of  say  Ft(xt)  =  (.di  ...,d,)10*,  it  is  reasonable  that  d,  or  d,.,  are  much  less 
likely  to  be  correct  than  d\  or  d2.  This  can  lead  to  real  problems  when  coupled  with  the  fact  that 
Fi(Xc+hjej)=(.d\'  d,0-lO *'  will  probably  have  e'=e,  and  d*  '=dk  for  k  =  l,2,...,r<r,  with  T 

closer  to  t  the  smaller  I  Ay  I  is.  In  other  words,  the  smaller  we  take  I  Ay  I,  the  more  of  the  most  accurate 
leading  digits  of  F(x)  are  canceled  out  by  the  subtraction  F, (xc  +hJeJ)-Fi(xc)  in  the  numerator  of  the 
forward-difference  formula  (3.2.1).  This  means  that  the  difference  will  have  at  most  (r  -7)  meaningful 
digits,  which  are  computed  using  the  less  trustworthy  trailing  digits  of  F(x).  Thus,  IA;  I  must  at  least  be 
large  enough  so  that  F,  (xt )  and  F,  (xc  +hjej)  differ  in  some  trustworthy  digits. 

The  standard  rule  to  use  in  setting  the  entire  vector  A  for  (3.2.1)  or  (3.2.4)  is 
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where  (if  is  (he  relative  accuracy  in  the  subroutine  used  to  evaluate  F  or  / .  In  other  words,  if  the  routine 

is  accurate  t o  tF  <t  digits,  then  (if  =  10~*'  while  if  it  is  accurate  in  all  its  digits  we  take  (if  =  (L  If  some 

1 

component  [x],  =0,  then  we  take  A,  =(tf 7  for  lack  of  a  better  choice.  This  rule  attempts  to  balance  the 
mathematical  and  numerical  error  in  (3.2.1);  see  e.g.  Dennis  and  Schnabel  [1983]. 

The  reader  will  see  that  all  these  numerical  difficulties  clearly  are  compounded  in  the  Hessian 
approximation  (3.2.8),  which  calls  for  a  larger  relative  magnitude  of  A  since  the  inaccuracies  in  the 
numerator  are  divided  by  A2.  We  also  see  that  the  bound  in  (3.2.9)  points  toward  choosing  all  the  com¬ 
ponents  of  A  to  be  about  the  same  magnitude.  Thus,  we  recommend  that  Xe  be  scaled  as  well  as  possible, 
and  an  analysis  suggests  that 

Ae=( (3-2.12) 

This  rule  is  also  suggested  for  use  in  (3.2.S). 

Next  we  state  a  theorem  on  the  theoretical  rate  of  convergence  of  finite-difference  methods.  The 
reader  can  easily  furnish  a  proof  by  combining  Lemma  3.2.1  with  Lemma  3.1.1. 


Theorem  3.2.4.  Let  F  and  x-  obey  the  hypotheses  of  Theorem  2.1.1  in  the  /i  norm.  There  exists  E,-q  >  0 
such  that  if  [A*]  is  a  sequence  in  JR"  with  OS  II  A*  ll-Sq,  and  x0e  N (x- , e),  then  the  sequence  {x* }  gen¬ 
erated  by 


**J*\ 


F(xk+(hk)ie,)-F(xt) 

- mr - 


dF 


(x*). 


(A*),  *0 
(A*)/  =0, 


x*+1=x*-5r1/r(x*),  k  =0, 1,  •  •  ■ 


is  well  defined  and  converges  q -linearly  to  x* .  If  ^m  II  A*  ll_=0,  then  the  convergence  is  q  -superlinear.  If 


there  exists  some  constant  C\  such  that  II  A*  II.SCi  I  lx*  -x.  Hi,  or  equivalently,  a  constant  Cj  such  that 
1 1  At  ll_£C2ll/r(xi)lli,  then  the  convergence  is  q  -quadratic. 


Even  though  Theorem  3.2.4  does  not  consider  the  finite  precision  effects  discussed  above,  it  reflects 
the  practical  experience  with  finite  difference  approximations :  if  the  stepsizes  are  properly  selected,  then 
finite  difference  methods  give  similar  performance  to  the  same  methods  using  analytic  derivatives.  The 
main  disadvantage  of  using  these  approximations  is  their  cost.  The  forward  difference  approximations 


.  A  m  Ii  AJI 


(3.2.1)  to  /(x)  or  (3.2.4)  to  g(x)  require  n  additional  evaluations  of  F{x)  or / (x),  respectively,  while  the 
central  difference  approximation  (3.2.5)  to  g(x)  requires  2 n  additional  evaluations  of  / (x).  The  approxi¬ 
mations  to  H{x)  require  either  n  additional  evaluations  of  g (x)  for  (3.2.1)  or,  for  (3.2.8),  (n2+3n)/2 
evaluations  of  /  (x).  These  costs  can  be  considerable  for  problems  where  function  evaluation  is  expensive. 
In  the  remainder  of  this  section  we  discuss  cheaper  approximations  to  7(x)  and  H  (x)  that  may  be  used 
instead. 

It  turns  out  that  if  g(x)  is  not  available  analytically,  it  is  almost  always  approximated  by  finite  differ¬ 
ences,  because  an  accurate  gradient  is  crucial  to  the  progress  and  termination  of  quasi-Newton  methods, 
and  the  secant  approximations  that  we  discuss  next  are  not  accurate  enough  for  this  purpose.  In  Section  5.1 
we  discuss  a  class  of  methods  that  does  not  require  gradients. 

3 3.  Broyden’s  Method 

In  one  dimension,  the  secant  method  is  an  effective  local  method  for  solving  nonlinear  equations.  It 
can  be  viewed  as  a  forward-difference  method  in  which  the  step  size  h *  used  in  constructing  the  new 
iterate  from  x„  is  taken  to  be  xc  -x»,  so  that  the  local  model  derivative  B+  is 
[f  (x*+xe  -x»))-f  (x*)]/[xc  -x*].  Thus,  no  extra  F  values  are  needed  to  determine  B+  and  build  the  new 
local  model  since  f  (x»-t-k*)  =  F(x, ). 

From  the  local  model  point  of  view,  the  secant  method  follows  from  assuming  that  we  will  determine 
the  approximate  derivative  in  the  new  model  of  F(x++d),  M.(x+  +  d)  =  F(x+)+B+d,  by  requiring 
M  *(Xe )  to  match  F  (x* ).  This  means  that 

F(Xc)  =  A/*(x,+  (Xe  -x«))  =  /r(x*)  +  5*(xc  -X*) 
is  used  to  determine  B  *.  This  results  in  the  system  of  linear  equations 

Bse=yc  (3.3.1) 

where  re  *x*-xc  and yc  =F(x+)~ F (xc). 

For  n  =  1,  (3.3.1)  uniquely  determines  5*.  For  n  >  1,  there  is  an  n  x  (n  -  1)  dimensional  linear  mani¬ 
fold  in  1R***  of  soluuons  B  to  (3.3.1).  The  most  commonly  used  secant  method  for  systems  of  nonlinear 
equations,  Broydcn's  method,  makes  a  specific  selection  of  B*  from  this  manifold  which  costs  only  a  small 
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multiple  of  n 2  to  compute  and  is  a  rank-one  correction  to  Bc .  It  also  has  a  very  elegant  geometric  interpre¬ 
tation  given  by  the  following  lemma  from  Dennis  and  More  [1977]. 


Lemma  3 .3.1.  Let  Bc  s  ,  sc  ,ye  e  R" ,  s  * 0.  Then  Broyden’s  update 


is  the  unique  solution  to 


n  o  ,  (yc-BcSc)sT 

B+-B'  + - 37c - 


min  Ilf?  -Bc  Ilf  subject  to  Bsc  =ye . 


scsl 


(3.3.2) 


Proof:  If  Bsc ~ye,  then  B+-Be  =  [5  -Bc]  -y-^-  and  so 

Se  Sc 


Sc  Si 


II B+-Bc  Ilf  S  II B  -Bc  ||f  -  ||^-ll2=  II B-Bc  Ilf 

Sc  Sc 


Thus,  Broyden’s  method  generalizes  the  1 -dimensional  secant  method  by  changing  the  current 
derivative  approximation  as  little  as  possible  in  the  Frobenius  norm  consistent  with  satisfying  the  secant  or 
quasi-Newton  equation  (3.3.1).  For  this  reason  it  is  called  a  least-change  secant  update.  Of  course,  this 
leaves  the  problem  of  finding  B 0  to  start  the  process.  Generally  B0  is  obtained  by  finite  differences. 

Broyden,  Dennis  and  More  [1973]  proved  that  Broyden’s  method  is  locally  q  -superlinearly  conver¬ 
gent.  The  proof  is  in  two  parts,  and  it  is  typical  of  all  the  least-change  secant  method  proofs.  First,  one 
establishes  local  q  -linear  convergence  by  proving  bounded  deterioration  and  applying  Theorem  3.1.2. 
Then,  one  gets  q -superlinearity  by  establishing  (3.1.5). 

Lemma  3 3J2.  Let  D  c  1R"  be  a  convex  domain  containing  xe ,  x+,  and  x. .  Let  /  e  Lip^D ),  Bc  e  1R’,X" , 
and  letf?+  be  given  by  Broyden’s  update  (3.3.2).  Then, 


llf?*-7(x»)llf  £  llf?e  -7(x»)llf  +yo(xc,x+) 
Proof:  Remember  that  o(xe  ,x+)  ■  max  ( llxe  -x.  1 1 2,  llx+-x.  1 1 2)  - 

B+-J(x.)  =  Bc-J(x.)+(ye±J{x,)fe  )S' 

Se  Sc 


(3.3.3) 


=  [Sc -/(*•)] 
=  [Be  -J  (x« )] 


/- 


ScSl 

si Sc 


l- 


Sc  Si 
Sc  Sc 


,  yc  -J{x‘)sc)sl 

'  U7c 

+  j[7(xc+rsc)-y(x.)]d, 


Sc  Sj 
si  Sc 


V  %  s  *.  . 


It  is  then  straightforward  to  obtain 


IIS+-y(x.)M/'  £  WBe-J(x.)\\p- 11/ 


Sc  sc 

+yl  Wxc +ise-x.H2d,  \l2dL\l2 


sjsc 


'1' 


Z  IIBC  -,/(x.)llF+YC(x+,xe). 


s  s 

since  I  — y-^-  and  -y^-  are  l2  projection  matrices  and  so  they  have  unit  norm. 

Sc  Sc  Sc  Sc 


This  completes  the  proof,  but  we  can’t  resist  some  comments  about  the  geometry  of  the  proof  and  its 
relation  to  the  more  general  derivation  of  least -change  secant  methods.  Notice  that 


BW(x.)  =  [«♦-/(*)]• 

r  _  Sc  Sc 

+  [BW(x.)] 

Sc  si 
*777“ 

Sc Sc 

Sc  Sc 

In  IR"**  with  the  Frobenius  norm  inner  product,  this  is  just  the  orthogonal  decomposition  of  B+-/(x. )  into 
its  projection  into  the  subspace  of  matrices  that  annihilate  sc  and  the  orthogonal  complement  of  the  annihi- 
lators  of  se.  But  the  essence  of  Broyden’s  method  is  that  B*  has  the  same  projection  as  Bc  into  the  annihi- 
lators  of  se .  Thus, 


B+-J(x.)=[Bc-J(x.)] 


(3.3.4) 


Now  the  norm  of  the  first  term  on  the  right  hand  side  of  (3.3.4)  will  be  smaller  than  I IBC  -/(*• )  I  I ,  while 
the  magnitude  of  the  second  term  is  totally  dependent  on  the  sagacity  of  our  choice  of  yc  -B^se  as  an 
approximation  to  J  (x.  )sc . 


In  general,  the  idea  of  a  least-change  secant  method  is  to  adopt  an  inner  product  structure  on  matrix 
space  and  a  secant  condition  (3.3.1).  In  addition,  one  may  require  Be  and  B  +  to  be  in  some  closed  linear 
manifold  A  of  matrices  defined  by  another  desirable  property  like  symmetry  or  sparsity.  One  then  obtains 
B*  by  orthogonally  projecting  Bc  into  the  generalized  intersection  of  A  with  the  matrices  that  satisfy 
(3.3.1),  using  the  chosen  inner  product.  By  generalized  intersection,  we  mean  the  projection  of  the  set  of 
secant  matrices  into  A . 

The  next  subsection  contains  another  application  of  this  approach.  Dennis  and  Schnabel  [1979] 
derive  many  more  interesting  updates  based  on  this  approach.  Dennis  and  Walker  [1981]  give  conditions 
on  the  secant  equations  and  the  inner  products  used  at  each  iteration  so  that  the  resulting  quasi-Newton 
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method  is  locally  ^ -linearly  as  fast  as  the  method  that  uses  £*  *£• .  Schnabel  [1983]  considers  multiple 
secant  equations.  Dennis  and  Walker  [1985]  consider  the  case  when  an  appropriate  secant  condition  is 
imperfectly  known.  Grzegorski  [1985],  Flachs  [1986]  relax  the  conditions  on  A  to  convexity. 

Now  we  return  to  the  consideration  of  Broyden’s  method.  In  order  to  complete  the  proof  of  super- 
linear  convergence,  we  need  to  show  that  (3. 1.5)  holds.  In  fact,  it  is  not  hard  to  show  that 


-/(*)]*  II 

— rnrn - 0 


is  also  equivalent  to  superlinear  convergence  under  the  same  hypotheses.  Notice  that 


(3.3.5) 


ll[B* -/(*.)]*  Ilj  )]W„ 

- ITaTT^ - " - 77Tk - (316) 

and  so  (3.3.5)  seems  a  reasonable  condition  to  try  for  in  light  of  our  previous  discussion.  For  complete¬ 
ness,  we  state  the  theorem.  For  an  elementary  complete  proof,  see  page  177  of  Dennis  and  Schnabel 
[1983]. 


Theorem  3J_3.  Let  F  satisfy  the  hypotheses  of  Theorem  2.1.1.  There  exist  positive  constants  £.8  such 
that  if  llxo-x.  Iliceand  llB0-f  (x0)l !  < 8,  then  the  sequence  [x*]  generated  by  Broyden’s  method 


x*-i  =  x* -B*-‘F(x»),  k  *0,1,  •  •  ■ 

„  „  .  (yk-Bksk)s[ 

Bi+'-Bk+ — 7ITk 

yt  «F(x*+i)-F(x*),  sk  =xt+1-x* 
is  well  defined  and  converges  q-superlinearly  to  x- . 


Proof:  We  will  sketch  the  proof.  First,  notice  that  Lemma  3.3.2  shows  that  the  hypotheses  of  the  bounded 
deterioration  result.  Theorem  3.1.2,  holds  with  fl.  =  7  (x* ).  Thus  the  method  is  at  least  q  -linearly  conver¬ 
gent.  Let  £*  =Bk  -7(x.)- 

In  order  to  show  (3.3.5),  we  need  a  technical  lemma.  From  the  Pythagorian  Theorem  in  IR"*" ,  we 

have 


m  tt 


* ll£*M'“ 2iir*'ri, 


(3.3.7) 
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where  the  inequality  follows  from  :a£  l(3l  kO  implies  (a2  -  (J2)7  <  a  -  fJ2/2a. 
By  (3.3.4),  (3.3.6),  and  (3.3.7), 


1 1  £**i  1 1  f  S  M£* 


I  - 


j*j£ 

slsk 


ll/r+ll£*+1A£llf 

Si  St 


s,,£*,,/,_itt^tt7 


II£*j*  ll2 


I  Is*  1 1 2 


ll£*4.]S*  1 1 2 


+~nrr 


Thus,  from  some  manipulation  and  the  proof  of  Lemma  3.2.2, 

S  2 ll£*  Ilf  [ll£*  Ilf -»!£*+,  Ilf +Yer(x*+,.x*)]  . 

Now,  since  Lemma  3 32  shows  that  ( IIB*  Ilf )  is  uniformly  bounded  and  I  lx*  — x.  ll2  converges  to  zero  at 
least  q -linearly,  ( ll£*  Ilf }  is  uniformly  bounded  by  some  b  and  ^a(x;*i,x,)<«>.  This  gives  that 


% 


ll£*s*  ll2 


TTJ*Tn 


52 b 


<  , 


IIEollf  -  ll£*+i  Ilf  +Y^O(xk+i.xk) 


and  (3.3.5)  follows. 

A  natural  question  is  whether  [Bk }  always  converges  to  /(x. ).  The  answer  is  that  it  does  not  and  that 
the  final  approximation  may  be  arbitrarily  far  from  J(x>),  c.f.  pg.  185  of  Dennis  and  Schnabel  [1983].  On 
the  same  page  is  the  following  example  which  provides  a  comparison  with  Newton's  method. 

Let 
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V 


Broyden's  Method 


Newton’s  Method 


1.5 

2.0 

*0 

1.5 

2.0 

0.8060692 

1.457948 

Xl 

0.8060692 

1.457948 

0.7410741 

1.277067 

Xz 

0.8901193 

1.145571 

0.8022786 

1.159900 

X  3 

0.9915891 

1.021054 

0.9294701 

1.070406 

X4 

0.9997085 

1.000535 

1.004003 

1.009609 

*5 

0.999999828 

1.000000357 

1.003084 

0.9992213 

x< 

0.99999999999992 

1.0000000000002 

1.000543 

0.9996855 

X  7 

1.0 

1.0 

0.99999818 

1.00000000389 

Xg 

0.9999999885 

0.999999999544 

X9 

0.99999999999474 

0.99999999999998 

X\Q 

1.0 

1.0 

Xll 

The  final  approximation  to  the  Jacobian  generated  by  Broyden's  method  is 


Aio1 


1.999137  2.021829 
0.9995643  3.011004 


,  whereas  J(x»)  =  . 


We  proved  superlinear  convergence  by  proving  that  ^ 
ll£*r*  II 


II 

niirrr 


<  <*>.  It  is  easy  to  see  that  if 


<  «>,  then  (Bt )  converges,  because 


D  d  _  (y*-£*J*)j*  _  Eksksf  ,  ^  _  N 

Bk^.Bk  = - ^ +  0  • 

Furthermore,  if  we  just  knew  how  fast  goes  to  zero,  we  could  say  more  about  the  rate  of  conver¬ 

gence  of  (x*)  to  x* .  The  only  related  result  we  know  about  is  due  to  Gay  [1979],  He  proves  that 
xju-m  if  J (x)  is  constant  for  all  x,  Le„  F  is  affine.  This  allows  him  to  prove  the  2* -step  <7 -quadratic 

convergence  of  (x* }  to  x* ,  and  that  in  turn  implies  r  -order  at  least  2^* . 


Broyden's  method  is  very  popular  in  practice,  for  two  main  reasons.  First,  it  generally  requires 
fewer  function  evaluations  than  a  finite  difference  Newton’s  method.  Second,  it  can  be  implemented  in 
ways  that  require  only  OOi2)  arithmetic  operations  per  iteration.  We  conclude  this  section  by  mentioning 
three  basic  ways  to  do  this. 


a 


If  n  is  small  enough  to  allow  storage  of  a  full  n  x  n  Jacobian  approximation  and  use  of  a  QR  factori¬ 
zation  to  solve  (3.3),  then  we  recommend  a  scheme  due  to  Gill,  Golub,  Munay,  and  Saunders  [1974],  In 


•.  I 
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this  scheme,  one  has 


BC=QCRC  but  wants  B.  =  QJl+ 

and.  since  B.  =  Be  +  — — *c  —  mBe  +  uvT, 

Sc  $c 

B*  =  Qc  [Rt  +  {Qju)  vT]  •  &  [*e  +  wvT] . 

Now  a  low  multiple  of  n2  operations  is  sufficient  to  get  a  QR  factorization  of  a  rank -one  update  to  an  upper 
triangular  matrix  and  so 

B  +  =  QJI+  where  Re  +  wvT  =  QR+  and 


The  second  and  third  ways  of  implementing  the  update  am  both  based  on  the  Sherman -Morrison- 
Woodbury  formula  (c.f.  Dennis  and  Schnabel  [1983]  pg.  188).  It  is  easy  to  see  that  if  B  is  nonsingular  and 
u,v  €  IR* ,  then  B  +uvr  is  nonsingular  if  and  only  if  1  +  vTB~'u  «G)*0.  Then, 


(B  +  uv r)-1  =  B -  -i-  (B u )  vT  B 


(3.3.8) 


(3.3.9) 


Broyden  [1963]  suggests  updating  the  sequence  of  approximate  inverse  Jacobians  using  (3.3.8),  i.e., 

<s  9  x * - rftS-Tfc - 

This  is  only  feasible  for  about  the  same  class  of  dense  problems  as  the  QR  updating  scheme.  The  QR 
scheme  is  more  trouble  to  implement,  but  it  has  the  advantage  that  the  condition  number  of  the  current 
Jacobian  approximation  can  always  be  monitored  by  using  standard  techniques  to  estimate  the  condition 
number  of  an  upper  triangular  matrix.  Approaches  based  on  (3.3.8)  may  become  of  increased  importance 
on  parallel  computers. 


The  final  way  of  implementing  Broydcn’s  method  is  very  useful  for  large  problems,  and  it  is  based 
on  (3_3.9>.  The  idea  can  be  found  in  Matthies  and  Strang  [1979].  At  the  k*  step,  we  assume  that  we  have 
some  way  of  solving  Box=b  for  any  b\  for  example,  we  might  have  a  sparse  factorization  of  B0 ■  This 
allows  us  to  recursively  solve 


Bk*\Sk*\  —  -F(Xk+\) 


•.t 


by  using  (3.3.9)  and  solving 


BkWi  *Tn7n7(y‘'B***)’ 
computing  Mi«i*  ||5‘-||  *vt, 

solving  BkJk*i--F 

,  -  1  sfst+i 

and  computing  St*  1  =  St*i  -  —  w»  - 

This  scheme  must  be  restarted  whenever  the  number  of  vectors  allocated  to  store  the  update  vectors  is 
filled. 

3.4.  BFGS  Method  Tor  Unconstrained  Minimization 

Now  we  discuss  the  selection  of  a  secant  approximation  Be  to  the  Hessian  matrix  in  the  model 
(3.3.1).  We  could  simply  apply  Broyden’s  method  to  find  a  root  of  g(x)  =  0.  This  is  not  done  because 
more  effective  alternatives  exist  based  on  using  more  of  the  structure  of  the  Hessian  matrix  which  we  are 
trying  to  approximate.  In  this  section  we  will  try  to  acquaint  the  reader  with  the  high  points  of  this  rich 
material  Far  mare  information,  see  Dennis  and  More  [1977],  Dennis  and  Schnabel  [1979],  Fletcher 
[1980],  or  Dennis  and  Schnabel  [1983]. 

The  analog  of  (3.3.1)  far  solving  g  (x)*0  is  the  same  except  for  the  definition  of  yc : 

Bsc -ye  *g(x*)-f(xc).  (3.4.1) 

where  B  is  meant  to  approximate  H(x+).  Equation  (3.4.1)  causes  the  quadratic  model 

m(x»+d)  =  /(x*)+g(x*)rd  +  4 -dTBui 

* 

to  interpolate  /  (x«),  g(x+),  and  ;(Xc).  However  H(x+)  is  symmetric,  while  the  Broyden  update  fl-  of  Bc 
generally  will  not  be.  Also 

II  -//(jc*) II,-  S  WB+-H (x,)  Ilf 

indicates  that  we  can  approximate  H  (x+)  more  closely  with  a  symmetric  B .. 

Thus,  it  is  natural  to  use  the  least  change  ideas  of  the  last  sccuon  to  select  £.  as  the  projection  of  B. 
onto  the  intersection  of  the  matrices  obeying  (3.4.1)  with  the  subspacc  A  of  symmetric  matrices  in  IR"“* 
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Then  we  obtain  the  PSB  or  Powell  symmetric  Broyden  update 

B-B '*  HT, - HUT? - ■  (3'4'2’ 

Note  that  B ♦  will  inherit  symmetry  from  Bc .  This  update  can  be  quite  effective,  and  we  will  meet  it  again 
in  the  next  section.  It  is  not  generally  used  in  practice  for  two  reasons.  First,  it  can  have  difficulty  with 
poorly  scaled  problems.  Second,  we  will  see  in  Section  4  that  it  is  desirable  that  each  2?*  be  positive 
definite,  but  B  +  will  only  inherit  positive  definiteness  from  Bc  under  some  conditions  more  restrictive  than 
those  required  for  (3.4.1)  to  have  a  symmetric  positive  definite  solution  B  . 

An  obviously  necessary  condition  for  (3.4.1)  to  have  a  positive  definition  solution  B  is  that 

slye  -  sjBse  >  0 .  (3.4.3) 

We  will  now  prove  by  construction  that  (3.4.3)  is  also  sufficient  for  the  existence  of  a  symmetric  and  posi¬ 
tive  definite  solution  B  to  (3.4.1),  by  constructing  the  BFGS  method  due  to  Broyden  [1969],  Fletcher 
[1970],  Goldfarb  [1970]  and  Shanno  [1970]. 

If  we  assume  that  Bc  is  symmetric  and  positive  definite,  then  it  can  be  written  in  terms  of  its  Chole- 
sky  factors 

Bc=LtLj,  Lc  lower  triangular  . 

In  fact,  Le  will  probably  be  available  as  a  by-product  of  solving  (3.3).  We  want/?*  to  be  a  symmetric  posi¬ 
tive  definite  secant  matrix  which  is  equivalent  to  the  existence  of  a  nonsingular  /*  for  which 

B^-JJl  and  JJlsc-ye. 

Let  ve  *  Jlsc ,  so  that  7*ve  =yc  and  v/vc  =y/jc  if  7*  exists.  If  we  knew  vc ,  then  it  would  seem  reasonable 

to  take 


f  _ f  ,  (ye-Lcvc)v[ 

J  *  *  Lc  +  ■■  t - 

vTv, 


(3.4.4a) 


in  view  of  the  success  of  Broydcn’s  method.  Transposing  (3.4.4a),  multiplying  both  sides  by  sc ,  and  using 
Jlse  *  ve  and  vcrvc  =  yjsc  gives 


vTl  Tr 

ve  =JU'  =Lj$c  +vc  (1--^T~-) 

/C  5C 
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which  simplifies  to 


and  which  exists  if  yjsc  >  0. 


(3.4.4b) 


Equations  (3.4.4a,  b)  define  /+  such  that  JJl  =5+  is  the  BFGS  update.  It  is  customary  to  write  the 


BFGS  update  as 


B+  =  Be  + 


yc  yl  Bc  Sc  sjBc 


(3.4.5) 


- e^yT^  Sc  Be  Sc 

but  it  is  not  efficient  to  actually  calculate  and  factor  such  a  matrix  at  each  iteration.  Instead  Goldfarb 
[1976]  recommends  updating  the  Cholesky  factorization  LCLJ  of  Bc  by  applying  the  QR  update  scheme  of 
subsection  3.3.3  to 


Jl  —  Lj  +  Uc  Vc 

to  gttJZ  =  QJ-Z  in  0(n2)  operations.  In  fact,  we  don’t  need  to  form  Q+  since  we  only  care  about 


B+  =  JJZ=L.QZQd.Z=LJ.Z  . 


This  Cholesky  update  form  is  the  recommended  way  to  implement  the  BFGS  method  if  a  full  Chole¬ 
sky  factor  can  be  stored.  For  large  problems,  updates  can  be  saved  and  the  Sherman-Monison-Woodbury 
formula  applied  in  the  manner  of  Subsection  3.3.3.  It  is  also  possible  to  obtain  the  direct  update  to  the 
inverse. 


/d-i\  /d— i\  ,  (4c - (B -1  )c ye ) si + se (se ~(B~l)cyc )T  yf(.Sc-(B-l)cyc)TscsI 

(S  >■* - w, - - 


(3.4.6) 


The  BFGS  method  can  also  be  derived  as  the  least-change  symmetric  secant  update  to  (B~')c  in  any 
inner  product  norm  of  the  form  III (•) 111  ■  IIAf«.(-)AfI  II  where  MJ4Zsc  =yc-  In  other  words,  the  BFGS 
method  is  the  result  of  projecting  ( Bc )~'  into  the  symmetric  matrices  that  satisfy  B~xyc  =  sc ,  and  the  projec¬ 
tion  is  independent  of  a  large  class  of  inner  product  norms. 

If  instead  we  project  Bc  into  the  symmetric  matrices  for  which  Be  sc  =yc.  in  any  norm  of  the  form 
IIAf*1  ()M zT  Hf  ■  III (*) III  where  M+MZsc  =yc<  the  result  is  the  update  formula 
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»  »  .  (yc-Besc)yI+yc(yc-Besc)T  sj(yc  -Bcsc)ycy7  ,,  A 

B*=Bc  + - Wc - (ytt - •  (3A7) 

Equation  (3.2.7)  is  called  the  DFP  update  for  Davidon  [1959],  Fletcher  and  Powell  [1963],  and  it  also 
passes  positive  definiteness  from  Bc  to  fr  ♦  whenever  yjsc  >  0. 

Theorem  3.4. 1  is  a  local  superlinear  convergence  result  for  all  three  methods  we  have  discussed  in 
this  subsection.  This  is  a  combination  of  three  theorems  from  Broyden,  Dennis  and  Mord  [1973].  The 
proofs  follow  the  same  outline  as  the  proof  for  Broyden ’s  method. 

Theorem  3.4.1.  Let  /  satisfy  the  hypothesis  of  Theorem  2.2.1.  Then,  there  exist  e  >  0,  5  >  0  such  that  the 
sequences  generated  respectively  by  the  BFGS,  DFP,  or  PSB  methods  exist  and  converge  q  -superlinearly 
tox»  from  any  xo.fr  o  for  which  llx.  -xoll  Se  and  Ilfro-W  (x«)H  <8.  Furthermore,  the  PSB  method  con¬ 
verges  if  H  (x. )  is  nonsingular  but  not  positive  definite. 

The  BFGS  method  seems  to  work  especially  well  in  practice.  Generally  it  requires  more  iterations  to 
solve  a  given  problem  than  a  finite-difference  Newton  method  would,  but  fewer  function  and  gradient 
evaluations.  Most  experts  feel  that  a  property  which  contributes  to  the  success  of  the  DFP  and  BFGS 
methods  is  that  the  iteration  sequences  are  invariant  with  respect  to  linear  basis  changes  under  reasonable 
hypotheses.  This  property  is  not  shared  by  the  PSB  method.  The  superiority  of  the  BFGS  to  the  DFP  is 
still  an  interesting  research  topic  and  is  discussed  briefly  in  Section  6.3. 

A  final  issue  is  how  to  choose  the  initial  Hessian  approximation  fro  in  a  BFGS  method.  In  analogy  to 
Section  3.3  we  could  choose  fro  to  be  a  finite  difference  approximation  to  V2/ (xo),  but  besides  being 
expensive,  an  initial  Hessian  often  is  indefinite  and  then  may  be  perturbed  anyhow  as  we  will  see  in  Sec¬ 
tion  4.  Instead,  the  common  practice  is  to  set  fro = /  so  that  fro  is  positive  definite  and  the  first  step  is  in  the 
steepest  descent  direction.  This  choice  may  not  correctly  reflect  the  scale  of  the  Hessian,  and  so  a  one-time 
scaling  correction  often  is  applied  during  the  first  iteration;  see  Shanno  and  Phua  [1978a]  or  Dennis  and 
Schnabel  [1983]. 
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3 £  Sparse  Finite-Difference  and  Secant  Methods 

In  this  section,  we  will  consider  briefly  the  incorporation  of  sparsity  into  the  methods  of  the  previous 
three  subsections.  We  will  see  that  there  are  important  opportunities  for  efficiency  in  finite-difference 
Jacobian  and  Hessian  approximations.  The  development  of  effective  sparse  secant  methods  will  be  seen  to 
be  more  problematical. 

Medium  size  dense  problems  are  usually  solvable  by  library  subroutines  based  on  local  models  we 
have  already  considered.  When  a  problem  is  large  enough  to  make  sparsity  an  important  consideration,  it 
often  is  useful  to  incorporate  the  source  of  the  problem  into  the  method.  In  Section  6.1  we  will  discuss 
some  promising  approaches  to  this  end.  Here  we  discuss  general  purpose  approaches. 

We  begin  with  a  discussion  of  special  finite-difference  methods  for  sparse  problems. 

Curtis,  Powell  and  Reid  [1974]  began  one  of  the  most  elegant  and  useful  lines  of  research  on  sparse 
nonlinear  problems.  They  noticed  that  if  the  sparsity  structure  of  /(x)  is  such  that  some  subset  Cj  of  the 
column  indices  has  the  property  that  there  is  at  most  one  nonzero  in  any  row  of  the  submatrix  composed  of 
these  columns,  then  all  the  nonzero  element  in  this  submatrix  of  AF(xc,h)  could  be  calculated  from  the 
single  function  difference 

F(xt  +  y  hjej)-F(xc). 

In  order  to  illustrate  the  enormous  savings  possible  in  finite-difference  methods,  notice  that  if  J  (x)  is  tridi¬ 
agonal,  then  only  three  extra  values  of  F  (x )  are  needed  for  any  n  to  build  AF(xc,h ).  They  are 

F(.Xc  +hit\  +  h4e*hiei+  ■  •  •  )■  F(xc  +d\) 

F(xe  +hif2+hsei  htct+  ••')*F(xe+d  2) 

F(xt  +  hiei+h6eth9e9+  "  •  •  )*F(x<  +d}) . 

The  submatrix  consisting  of  the  first,  forth,  seventh,  tenth, ...  columns  of  &F  (Xc ,  h )  is  then  determined  from 
the  function  difference  F(xe  +di)-F(Xc).  The  other  two  submatrices  are  determined  in  the  same  way 
from  F (xc  +  d£-F(xe)  and  F(xc  +d3)-F(xc). 

Curtis,  Powell,  and  Reid  suggested  some  effective  heuristics  to  keep  the  number  of  submatrices,  and 
hence  the  number  of  function  evaluations,  small.  Coleman  and  More  [1983]  and  Coleman,  Garbow,  and 
More'*  [1984]  exploited  the  connection  between  a  subclass  of  graph  coloring  problems  and  the  problem  of 


determining  how  to  partition  the  columns  of  J  (x)  to  save  further  on  function  evaluations.  Although  they 
proved  that  the  problem  of  minimizing  the  number  of  submatrices  is  NP-complete,  they  developed  some 
very  useful  heuristics. 

Li  [1986]  noticed  that  if  no  component  of  sc  =x+-xc  is  zero,  then  the  function  value  at  the  past  itera¬ 
tion  F  (x-)  can  be  used  to  reduce  the  number  of  extra  function  evaluations  in  either  of  these  approaches  by 
one.  In  the  tridiagonal  case,  this  reduces  to  using  h=sc  and  F(xc)-F(xc  -d j), 
F(xc  -d\)-F(xe -d\-dj),  and  F {Xt-d\-di)-F(x-)  respectively  to  calculate  the  three  submatrices. 
He  suggests  leaving  the  j*  column  unchanged  if  (sc  );  =  0.  His  analytical  and  computational  results  sup¬ 
port  this  approach. 


This  approach  can  also  be  applied  to  approximate  a  sparse  Hessian  from  gradient  values.  Powell  and 
Toint  [1979]  developed  methods  to  further  reduce  the  number  of  gradient  evaluations  required  in  this  case. 
To  illustrate  their  main  idea,  assume  that  H  (x )  is  diagonal  except  for  a  full  last  row  and  column.  We  can 
then  approximate  the  last  column  by  [g(xc  +h„eK)-g  (xc )]  ■  A,-1  and  the  last  row  by  its  transpose.  The  first 


n  - 1  components  of  gfe  +  V  hjej)-g(xc)  suffice  to  obtain  an  approximation  to  the  rest  of  H(xc).  Thus 


two  extra  gradient  evaluations  suffice  while  n  would  be  required  for  the  same  sparsity  pattern  without  sym 


metry.  Powell  and  Toint  also  suggest  a  more  complex  indirect  approximation  method.  Coleman  and  Mord 
[1984]  and  Coleman,  Garbo w,  and  More'*  [1985]  showed  that  problem  of  minimizing  the  number  of  extra 
gradient  evaluations  also  is  related  to  graph  coloring,  and  again  developed  useful  heuristics  although  this 
problem  is  also  NP-complete.  Goldfarb  and  Toint  [1984]  show  the  connection  between  the  finite- 
difference  approximation  and  tiling  the  plane  when  the  nonlinear  problem  arises  from  numerically  solving 
a  partial  differential  equation. 

Now  we  consider  sparse  secant  methods.  Schubert  [1970]  and  Broyden  [1971]  independently  sug¬ 
gested  a  sparse  form  of  Broyden ’s  method.  Reid  [1973]  showed  that  their  update  is  the  Frobenius  norm 
least  change  secant  update  to  Bc  with  A  taken  to  be  the  set  of  matrices  with  the  sparsity  of  J  (x ). 


In  order  to  state  the  algorithm,  let  us  define  P,  to  be  the  1 2  projection  of  IR*  onto  the  subspace  2,  of 
1R*  consisting  of  all  vectors  with  zeros  in  every  row  position  for  which  the  corresponding  column  position 
of  the  i*  row  of  J(x)  is  always  zero.  That  is,  P ,  v  zeroes  the  elements  of  v  corresponding  to  the  zero  ele- 


menis  of  row  i  oiJ(x)  and  leaves  the  others  unchanged.  We  also  need  the  pseudoinverse  notation  a*  to 
denote  0  when  the  real  scalar  is  <j  =0  and  a~]  otherwise.  Then  the  sparse  Broyden  or  Schubert  method  is: 
Given  Bc  sparse,  and  xc 
solve  Bcsc  =-F(xc). 

Set  x+=Xc +rf  and 

B.  =  BC  +  J  [P, (sc )TPj (sc )]* ej (yc  -Bcsc)e,Pl(sc)T  .  (3.5.1) 

Note  that  (3.5.1)  reduces  to  Broyden’s  method  if  J (x)  has  no  nonzeroes. 

Marwil  [1979]  gave  the  first  complete  proof  that  (3.5.1)  is  locally  q -superlinearly  convergent  under 
the  hypothesis  of  Theorem  3.3.3.  In  practice,  this  method  is  cheaper  than  the  Broyden  update,  but  the  sav¬ 
ings  are  small  because  there  is  no  reduction  over  the  cost  of  Newton’s  method  in  solving  for  the  quasi- 
Newton  step.  An  important  practical  use  of  this  method  is  to  update  a  submatrix  of  an  approximate  Jaco¬ 
bian,  the  other  columns  of  which  might  be  generated  by  finite  differences  as  suggested  above.  Dennis  and 
Li  [1986]  test  and  analyze  a  strategy  based  on  using  heuristics  of  Coleman  and  Mord  to  pick  out  subsets  of 
columns  that  can  be  very  efficiently  approximated  by  finite  differences.  The  remaining  columns  are  then 
updated  by  (3-5.1).  The  results  are  very  good,  and  there  are  indications  that  similar  approaches  are  useful 
in  engineering  applications. 

For  unconstrained  optimization,  Marwil  [1978]  and  Toint  [1977]  constructed  a  sparse  analog  of  the 
PSB  update  (3.4.2).  Toint  analyzed  the  method  under  some  safeguarding,  and  Dennis  and  Walker  [1981] 
give  a  complete  proof  of  local  q  -superlinear  convergence.  An  example  by  Sorensen  [1981],  however, 
raises  doubts  about  the  utility  of  the  method.  We  will  not  dwell  on  this  topic  because  the  update  seems  to 
share  the  shortcomings  of  the  FSB,  and  in  addition,  it  requires  the  solution  of  an  extra  nxn  positive 
definite  linear  system  with  the  same  sparsity  as  H(x)  for  the  update  of  Bc  to  S+.  Thus  this  method  has  not 
had  a  major  practical  impact. 

Professor  Angelo  Lucia  of  Clarkson  reports  good  results  using  the  cheap  and  simple  expedient  of 
projecting  B+ defined  by  (3.5.1)  into  the  subspace  of  symmetric  matrices,  i.e.,  he  uses  the  approximate  Hes¬ 
sian  defined  by  -2-[fl++2?I].  Steihaug  [1980]  had  shown  this  method  to  be  locally  q -superlinearly  convcr- 


Unfortunately,  extending  the  BPGS  and  DFP  algorithms  to  sparse  problems  seems  at  a  dead  end. 
One  problem  is  that  a  sparse  positive  definite  approximation  may  not  exist  in  some  cases  where  yjsc  >  0. 
A  more  pervasive  problem  is  that  sparsity  is  not  invariant  under  general  linear  basis  changes,  while  the 
essence  of  these  secant  methods  is  their  invariance  to  any  linear  basis  changes.  We  will  comment  in  Sec¬ 
tion  6.1  on  some  work  that  uses  more  fundamental  problem  structure  to  get  around  the  problems  of  the 


straightforward  approach  to  least  change  secant  methods  for  sparse  unconstrained  minimization. 


4.  Globally  Convergent  Methods 


The  methods  for  unconstrained  optimization  discussed  in  Sections  2  and  3  are  locally  convergent 
methods,  meaning  that  they  will  converge  to  a  minimizer  if  they  are  started  sufficiently  close  to  one.  In  this 
section  we  discuss  the  modifications  that  are  made  to  these  methods  so  that  they  will  converge  to  a  local 
minimizer  from  a  poor  starting  point  x<>.  Methods  with  this  property  are  called  globally  convergent. 

Two  main  approaches  have  emerged  for  making  the  methods  of  Sections  2  and  3  more  globally  con¬ 
vergent  while  retaining  their  excellent  local  convergence  properties.  They  are  line  search  methods  and 
trust  region  methods.  Both  are  used  in  successful  software  packages,  and  neither  has  been  shown  clearly 
superior  to  the  other.  Furthermore,  both  approaches  certainly  will  play  important  roles  in  future  research 
and  development  of  optimization  methods.  Therefore  we  cover  both  approaches,  in  Sections  4.2  and  4.3 
respectively.  We  briefly  compare  these  approaches  in  Section  4.4. 

The  basic  idea  of  both  line  search  and  trust  region  methods,  is  that  they  use  the  quickly  convergent 
local  methods  of  Sections  2  and  3  when  they  are  close  to  a  minimizer,  and  that  when  these  methods  are  not 
sufficient,  they  use  some  reliable  approach  that  gets  them  closer  to  the  region  where  local  methods  will 
work.  The  basic  concept  behind  this  global  phase  is  that  of  a  descent  direction,  a  direction  in  which  /  (x ) 
initially  decreases  from  the  current  iterate  Xe .  Descent  directions  and  their  relation  to  local  methods  are 
discussed  in  Section  4.1.  Included  in  Section  4.1  is  a  discussion  of  the  well-known,  but  slow,  method  of 
steepest  descent. 

Global  strategies  for  solving  systems  of  nonlinear  equations  are  obtained  from  global  strategies  for 
unconstrained  optimization,  for  example  by  applying  the  methods  of  this  section  to  /  (x )  =  II F  (x )  II2.  For  a 
discussion  of  this  topic,  see  e.g.  Dennis  and  Schnabel  [1983],  Section  6.5. 

4.1  Descent  Directions 

The  basic  strategy  of  most  globally  convergent  methods  for  unconstrained  optimization  is  to 
decrease  /  (x)  at  each  iteration.  Fundamental  to  this  is  the  idea  of  a  descent  direction  from  xe ,  a  direction 
d  from  Xe  in  which  /  (x )  initially  decreases. 
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The  proof  of  Theorem  1.1  (the  necessary  condition  for  unconstrained  optimization)  showed  that  d  is 
a  descent  direction  from  x*  if  and  only  if 

Vf{xc)Td  <0. 

i.e.,  the  directional  derivative  in  the  direction  d  is  negative.  In  this  case,  for  all  sufficiently  small 
E>0 <f(Xe+td)<f(xe)-  Thus  given  a  descent  direction  d,  one  can  always  choose  a  new  iterate 
x*  =  Xt+ed  so  that/  (x+)  <  /  (xc ).  This  property  is  the  basis  of  globally  convergent  methods. 

A  natural  question  to  ask  is  whether  the  local  methods  for  unconstrained  optimization  discussed  in 
Sections  2  and  3  yield  steps  in  descent  directions.  These  methods  were  derived  by  considering  the  local 
quadratic  model  of  /(x)  around  xc ,  which  in  general  had  the  form 


(4.1.1) 


me  (xc  Hi )  =  /  (x, )  +  V/  (xe  )Td  +  He  d. 

They  then  chose  d  =  (x* )  causing  x+=xc+d  to  be  the  critical  point  of  (4.1.1). 

If  Hc  is  positive  definite,  x+  is  the  unique  minimizer  of  the  model;  furthermore 


V/ (xc )Jd  =  -V/ (Xe Y  Hc~'  V/(xc)<0 

so  that  d  is  a  descent  direction.  On  the  other  hand,  if  Hc  is  not  positive  definite,  then  not  only  doesn’t  the 
model  have  a  minimizer,  but  also  d  may  not  be  a  descent  direction. 

In  implementations  of  the  leading  secant  methods  for  unconstrained  minimization  such  as  the  BFGS 
method,  Hc  always  is  positive  definite.  Thus  the  steps  generated  by  these  methods  always  are  in  descent 
directions. 

When  Hc  =  V2/  (xe),  however,  it  may  not  be  positive  definite  when  Xc  is  far  from  a  local  minimizer. 
Thus  the  Newton  step  d  =  -V2/  (xe  )_1  V/  (xe)  is  not  necessarily  in  a  descent  direction.  We  will  see  that 
line  search  and  trust  region  methods  deal  with  indefinite  Hessian  matrices  in  different  ways. 

The  idea  of  choosing  x+  to  be  a  step  from  xc  in  a  descent  direction  d  also  leads  naturally  to  the  idea 
of  taking  steps  in  the  "steepest"  descent  direction.  By  this  one  means  the  direction  d  for  which  the  initial 
rate  of  decrease  from  xc  in  the  direction  d  is  greatest  For  this  definition  to  make  sense,  the  direction  d 


must  be  normalized;  then  we  can  define  the  "steepest"  descent  direction  as  the  solution  to 


min  Vf(zcYd  subject  to  \\d  11=1. 


(4.1.2) 


v.v.v.v 


“".".V. n  ’  ■  "A"  /  ”7^,  vXvva 
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The  solution  to  (4.12)  depends  on  the  choice  of  norm;  for  the  Euclidean  norm  it  is  d  = 
-V/  (ie)  /  1 1 T/-  (xt )  1 1 2.  which  is  generally  known  as  the  steepest  descent  direction. 

One  classic  minimization  algorithm  is  based  solely  on  the  steepest  descent  direction.  It  is  to  choose 
each  new  iterate  x*+i  to  be  the  minimizer  affix)  in  this  direction,  i.e. 

choose  tk  to  solve  mininjize  /  (x*  -t  V/  (x*)) 

Xk*\  =  Xk  -  tk  Vf  (x* ).  (4.1 .3) 

This  is  known  as  the  method  of  steepest  descent. 

The  method  of  steepest  descent  is  an  example  of  a  globally  convergent  method.  By  this  we  mean 
that  if  a  sequence  of  iterates  (x* )  is  generated  by  (4.1.3)  for  a  continuously  differentiable  /  (x)  that  is 
bounded  below,  then  Um  V/(x*)  =  0.  However  the  method  has  several  important  practical  drawbacks. 

Most  importantly,  it  usually  is  slow.  If  the  method  converges  to  x» ,  then  it  can  be  shown  that 


where  M2=  V2/(x. ).  and  c  =  (X,  -  X.)/(X,  +  X.)  for  (Xi.X,)  the  largest  and  smallest  eigenvalues  of 
V2/  (x.).  Furthermore,  for  any  /  (x),  (4.1.4)  can  be  shown  to  be  an  equality  for  some  starting  x0.  Thus  the 
method  is  only  linearly  convergent  and  may  be  very  slowly  convergent  for  problems  with  even  slightly 
poorly  conditioned  V2 fix-).  Secondly,  as  written  (4.1.3)  requires  the  solution  of  an  exact  one-variable 
minimization  problem  at  each  iteration.  The  steepest  descent  method  may  be  implemented  with  an  inexact 
minimization  and  still  retain  (4.1.4)  but  the  work  per  iteration  may  still  be  large.  Thirdly,  the  method  is 
very  sensitive  to  transformations  of  the  variable  space.  If  the  variable  space  is  changed  to  x  =  T-  x,  the 
Hessian  matrix  in  this  new  variable  space  becomes  T~T  V2/  (x)  T~l,  so  that  by  the  above  discussion  the 
rate  of  convergence  may  be  significantly  altered.  Indeed,  the  effectiveness  of  even  a  single  steepest  des¬ 
cent  iteration  in  reducing  /(x)  may  depend  significantly  upon  the  units  used  in  defining  the  variables.  In 
contrast,  the  performance  of  the  Newton  or  BFGS  methods  is  unaffected  by  linear  transformations  of  the 
variable  space. 

For  these  reasons,  the  method  of  steepest  descent  is  not  recommended  as  a  general  purpose  optimiza¬ 
tion  method.  We  will  see,  however,  that  steepest  descent  steps  play  a  role  in  the  trust  region  methods  dis¬ 
cussed  in  Section  4.2  and  in  the  conjugate  gradient  methods  of  Section  5.2.  Furthermore,  versions  of  the 
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method  of  steepest  descent  continue  to  be  used  successfully  in  some  practical  applications,  for  example 
problems  where  xq  is  close  to  x*  and  the  cost  of  computing  V2/  (x)  is  prohibitive,  even  though  conjugate 
gradient  or  BFGS  methods  may  be  more  efficient  for  suc.i  problems. 


4.2  Line  Search  Methods 


The  general  idea  of  a  line  search  method  is  to  choose  a  descent  direction  from  Xc  at  each  iterauon, 
and  select  the  next  iterate  x+  to  be  a  point  in  this  direction  that  decreases  /  (x ).  That  is 

choose  dc  for  which  V/  (xe)T  dc  <  0 
choose  x*  =  xe  +tede.ic  >  0,  so  that/ (x*)  <  /  (xc). 

Note  that  the  method  of  steepest  descent  fits  this  framework. 

Modem  line  search  methods  differ  from  the  method  of  steepest  descent  in  three  important  ways:  1) 
de  usually  is  chosen  to  be  the  Newton  or  secant  direction;  2)  lc  is  chosen  by  a  procedure  that  requires 
much  less  work  than  an  exact  minimization;  and  3)  te  =  1  is  used  whenever  possible,  so  that  the  line  search 
method  reduces  to  Newton's  method  or  a  secant  method  close  to  a  local  minimizer.  In  this  section  we  sum¬ 
marize  the  convergence  properties  of  such  methods,  the  selection  of  the  search  direction  dc ,  and  practical 
procedures  for  calculating  te . 

Starting  with  the  work  of  Armijo  [1966]  and  Goldstein  [1967],  it  has  been  shown  that  line  search 
methods  will  be  globally  convergent  if  each  step  satisfies  two  simple  conditions.  The  first  is  that  the 
decrease  in  /  (x )  is  sufficient  in  relation  to  the  length  of  the  step  sc  =te  dc ;  the  relation 

/(jC*)</(xe)  +  ay/,(xe)TJc  (4.2.1) 

usually  is  chosen  to  implement  this  condition  where  ae  (0,1)  is  some  constant.  Note  that  for  any 

sufficiently  small  step  in  a  descent  direction  de,  (4.2.1)  is  satisfied  for  any  a  <  1.  The  second  condition  is 

that  the  step  is  not  too  short.  The  equation 

V/(xt)ri(iPV/(xc)rjc  (4.2.2) 

is  most  commonly  chosen  to  implement  this  condition,  where  [J  e  (0,1);  it  says  that  the  step  must  be  long 

enough  so  that  the  directional  derivative  increases  by  some  fixed  fraction  of  its  original  magnitude. 


/•  u*»  ---  •«  -■*  *.  ^  .-V 


The  main  value  of  equations  (4.2.1)  and  (4.2.2)  is  that  incorporating  them  into  a  line  search  algo¬ 
rithm  leads  to  a  practical  and  globally  convergent  method.  Theorem  4.2.1  says  that  given  any  descent 
direction  dc  and  0<a<P<l,itis  always  possible  to  choose  tc  >  0  so  that  x*  =  Xc  +tcdc  satisfies  (4.2. 1 ) 
and  (4.2.2)  simultaneously.  Theorem  4.2.2  says  that  if  every  iterate  is  chosen  in  this  way  and  the  directions 
dc  are  selected  reasonably,  the  method  will  be  globally  convergent  For  proofs  of  these  theorems,  see 
Wolfe  [1969,1971]  or  Dennis  and  Schnabel  [1983]. 

Theorem  4.2.1  Let /  :R"-*R  be  continuously  differentiable  and  bounded  below.  Let  xt  e  R* .  dt  e  Rn 
satisfy  V/ (x*)rd*< 0.  Then  if  0  <  a  <  P  <  1,  there  exist  0  <  ti  <  ti  such  that  for  any  r*  e  [ti,  rj. 
x*+i  =  x*  +  tkdk  =  x*  +  Sk  satisfies 

/(t»*,)</(xl)  +  aV/(x1)rii  (4.2.3) 

and 

V/ (x*+i )Tsk  £  p  V/ (xk)TSk  ■  (4.2.4) 

Theorem  4.2 2  Let  /  :  be  continuously  differentiable  and  bounded  below,  and  assume  V/(x)  is 

Lipschitz  continuous  in  R*.  Given  Xoe  R" ,  suppose  the  sequence  (x* )  is  defined  by  x*+i  =  x*  +dk, 
k  =0,1 . and  that  there  exists  cr>0  such  that  for  each  k  >0, 

V/(x*Fs*  <  -o  IIV/(x*)  II  2  list  llz  (4.2.5) 

and  (4.2.?'  (4.2.4)  are  satisfied.  Then  either  V/(xt)  =  0  for  some  k ,  or  hjm  Vf  (x* )  =  0. 

The  only  restriction  in  Theorem  4.2.2  that  we  have  not  yet  discussed  is  equation  (4.2.5).  This  simply 
says  that  each  step  direction  must  be  a  descent  direction  where  in  addition,  the  angle  between  se  and  ihe 
negative  gradient  is  less  than  some  fixed  angle  less  than  90° .  For  example,  if  j*  =  -Hkx  V/  (x*)  where  //* 
is  V2/  (n)  or  any  approximation  to  it,  then  (4.2.5)  is  satisfied  if  the  condition  numbers  of  Ht  are  uniformly 
bounded  above.  This  is  not  a  big  restriction  in  pracuce  although  not  all  methods  can  be  shown  to  enforce  ii 
in  theory. 

Other  conditions  can  be  substituted  for  (4.5)  and  still  allow  Theorem  4.2.1  and  4.2.2  to  hold.  A  com¬ 
mon  substitution  for  (4.2.2)  is 


I V/ (x»)rrc  I  £  P  I Vf(xc)Tsc  I 


(4.2.6) 


again  with  |J  e  (a,l)  ;  as  J3  -» 0,  (4.2.6)  causes  x«.  to  approach  a  minimizer  of  /(x)  along  the  line  in  the 
direction  dc  from  xc  .  Another  substitution  for  (4.2.2)  which  causes  our  theorems  to  remain  true  is 

/(x+)  *  f(xc)  +  yVf(xc)Tsc 

for  some  ye  (0,a) . 

Note  that  (4.2.2)  (or  (4.2.6))  and  V/  (xe  )r <4  <  0  imply 

(Vf  (x+)  -  Vf(Xc))T  sc  Z  ((3-1)  V/  (zc  )T  sc  >  0 

which  is  the  condition  (3.4,3)  that  we  saw  is  necessary  and  sufficient  for  the  existence  of  symmetric  and 
positive  definite  secant  updates.  Thus  by  enforcing  (4.2.2)  or  (4.2.6)  in  a  BFGS  method,  we  also  ensure 
that  it  is  possible  to  make  a  positive  definite  update  at  each  iteration. 

Two  practical  issues  remain  in  applying  Theorems  4.2.1  and  4 22  to  obtain  a  practical  line  search 
algorithm  :  the  choices  of  the  step  direction  dc  ,  and  the  efficient  calculation  of  the  step  length  ic  to  satisfy 
(4.2.1)  and  (4.2.2).  In  addition,  we  need  to  show  how  we  retain  the  fast  local  convergence  of  the  methods 
discussed  in  Sections  2  and  3. 

Our  methods  arc  based  upon  the  quadratic  model  (4.1.1)  where  He  =  V2/ (Xc)  or  an  approximation  to 
it.  When  Hc  is  a  BFGS  approximation,  then  it  is  positive  definite  and  line  search  methods  use 
dc  =-//e_1  V/  (xe)  .  We  saw  in  Section  4.1  that  this  dc  is  always  a  descent  direction.  Generally  BFGS 
based  line  search  methods  do  not  explicitly  enforce  (4.2.5);  it  can  be  shown  to  be  true  in  theory  under  cer¬ 
tain  assumptions  (see  e.g.  Broyden,  Dennis  and  More  [1973],  Powell  [1976]). 

When  He  =  V2/  (xe)  or  a  finite  difference  approximation  to  it,  then  Hc  may  or  may  not  be  positive 
definite.  The  standard  practice,  due  to  Gill  and  Murray  [1974]  is  to  attempt  the  Cholesy  factorization  of  Hc 
in  such  a  way  that  the  result  is  the  factorization  LeLj (or  LcDcLj)  of  (Hc  +  Ec).  Here  Lc  is  lower  triangu¬ 
lar,  Dc  is  positive  diagonal,  and  Ee  is  a  non-negative  diagonal  matrix  which  is  zero  if  He  is  positive 
definite  and  not  too  badly  conditioned.  Then  dc  is  obtained  by  solving  LcLjdc  =-  V/(xc)  ,  so  that 
dc  =  ~(He  +  Ee )-'  V/  (xe )  .  Thus  dc  is  the  Newton  direction  if  Hc  is  safely  positive  definite,  as  it  will  be 
near  a  strong  local  minimizer  and  usually  at  most  other  iterations  as  well.  Otherwise  dc  is  some  descent 
direction  related  to  the  Newton  direction.  In  all  cases,  the  cost  of  the  factorization  is  only  Ofn1)  operations 
more  than  a  normal  Cholesky  factorization,  dc  obeys  (4.2.5),  and  the  size  of  Ec  is  bounded  in  terms  of  the 
size  of  He  .  Schnabel  and  Van  Vleck  [1987]  recently  have  developed  a  new  version  of  this  modified 


Cholesy  decomposition  which  appears  to  reduce  the  size  of  Ec  while  improving  the  condiuomng  of 
Hc  +  Ec  . 


Thus  any  line  search  method,  whether  it  chooses  Hc  to  be  the  Hessian  or  a  BFGS  approximation  in 
it,  will  use  dc  -  -H~x  V/ (xc)  as  the  search  direction  in  the  neighborhood  of  a  minimizer  x.  where  V2/  (x.  > 
is  positive  definite.  If  the  steplength  tc  =  1  is  permissible,  this  means  that  the  global  convergence  results  of 
this  section  arc  consistent  with  the  fast  local  convergence  of  Sections  22  and  3.4.  Theorem  4.2.3,  due  to 
Dennis  and  More  [1974],  shows  that  this  is  the  case  as  long  as  a  <  -y  in  (4.2.1). 

Theorem  4.2.3  Let  /  :  R"  -» R  have  a  Lipschitz  continuous  Hessian  in  an  open  convex  set  D  .  Consider  a 
sequence  x*  generated  by  x*+)  =  x*  +  tkdk,  where  V/  (x*  )r dk  <  0  for  all  k  and  tk  is  chosen  to  satisfy 
(4.2.1)  with  a  <  -y  ,  and  (4.2.2).  If  x*  converges  to  a  point  x.  €  D  at  which  V2/  ( xk )  is  posiuve  definite, 
and  if 

i*J2L - H3*n - =  0  •  (4 1 

then  there  is  an  index  jfc0  2  0  such  that  for  all  A  2  ko ,  f*  =  I  is  admissible.  Furthermore,  V/(x- )  =  0  .  and  if 
Ik  =  1  for  all  k  2  k0  ,  then  {  x*  }  converges  q  -superlinearly  to  x. . 

If  exact  Hessians  arc  used,  d*  will  be  -^/(x*)-1  V/(x*)  for  all  xk  sufficiently  close  to  x» ,  so  that 
(4.2.7)  is  trivially  true  and  quadratic  convergence  is  achieved  by  using  u  =  1  .  In  a  BFGS  method,  the 
analysis  of  Broyden,  Dennis,  and  More  [1973]  or  Powell  [1976]  shows  that  (4.2.7)  is  true  so  that  q- 
superlinear  convergence  can  be  retained. 

From  Theorem  4.2.3,  we  see  that  to  combine  fast  local  convergence  with  global  convergence,  a  prac¬ 
tical  procedure  for  selecting  the  steplength  tc  should  always  try  te  *  1  first  (at  least  near  a  minimizer)  and 
use  it  if  it  is  admissible.  Beyond  this,  experience  has  shown  that  a  practical  procedure  for  choosing  tc  to 
satisfy  (4.2.1)  and  (4.2.2)  should  aim  to  be  as  efficient  as  possible,  in  that  it  chooses  a  and  j3  so  that  there  is 
a  wide  range  of  points  satisfying  (4.2.1)  and  (4.2.2),  and  uses  the  first  point  that  it  finds  in  this  range  rather 
than  trying  to  closely  approximate  the  minimizer  of  /  (x )  along  the  line  xc  +  /  dc .  Many  strategies  for 
accomplishing  these  goals  efficiently  have  been  proposed,  and  probably  every  line  search  that  is  coded  is 
unique  in  some  way.  Algorithm  4.2.1  indicates  the  structure  of  a  representauve  line  search. 
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There  are  four  possible  stages  in  this  line  search.  If  tc  =  1  satisfies  both  (4.2.1)  and  (4.22!).  then  x.  = 
Xe  •+•  df ,  and  no  further  line  search  calculations  are  performed.  If  te  =  1  is  not  admissible  because  it  fails 
(4.2.1),  then  ic  will  be  decreased.  This  is  most  often  done  by  safeguarded  quadratic  interpolation.  In  this 
procedure  the  minimizer  tm  of  the  one-dimensional  quadratic  approximation  q  (/ )  to  /  (xc  +  utc )  that  inter¬ 
polates  f(xc),Vf(xt)Tdc  .and  /  (xc  +  tc  dc )  is  calculated  by 

-t}VfcTdc 

"  =  2[/(xc  ■ntdc)-f(xt)-ie  V/te)'*)*  (4-8) 

and  the  next  stepsize  is  then  set  to  max  {tm,cxtc ),  where  typically  c,  =  0.1  .  (It  can  be  shown  that 
tm  >  tc! 2(l-ot)).  This  Imt-Decrease  stage  may  be  repeated  one  or  more  times  if  the  new  xc+tcdc  continues 
to  fail  (4.2.1).  Someumes  a  form  of  safeguarded  cubic  interpolation  is  used  instead  in  this  stage;  see  e  g. 
Dennis  and  Schnabel  [1983]. 

Alternately,  if  rc  =  l  satisfies  (4.2.1)  but  not  (4.2.2),  tc  will  be  increased.  Generally  a  simple  rule  like 
tc=2tc  is  used  although  more  sophisticated  strategies  are  possible.  This  Inn-Increase  stage  also  may  be 


Algorithm  4.2.1  Line  Search  Structure 

Given  /  (x ) ;  R"  -»R,  Xe ,  /  (xc  ),V/  (xc ) ,  descent  direcuon  de 

flow  0 ,  tup  ;=  ■» ,  done  :=  false,  tc  :=  1 
Repeat 

evaluate  /  (Xc  +  rc  dc ) 

If  xe  +  tc  dc  satisfies  (4.2. 1)  then 
evaluate  V/(xc  ♦  tcdc) 

If  Xe  +  rcdc  satisfies  (4.2.2)  Then 
done  :=  true 

Else 

r/ow  ;=  tc 
Lf  tup  =  °o  then 

tc  :=  Inn-Increase  (te) 

Else  ic  ^  Refine  (re ,  tiow ,  tup ) 

Else 

tup  .-tc 

If  tlow= 0  then 

te  :■  Inn-Decrease  (rc) 

Else  tc  :=  Refine  (rc ,  tlow ,  tup ) 

Until  done  *  true 


repeated  i fxe  +  kdc  continues  to  satisfy  (4.2.1)  and  fail  (4.2.2). 


After  one  or  more  repetitions  of  either  the  Init-Increase  or  Init-Decrease  phase,  either  an  admissible 
it  +tedc  is  found,  or  it  must  be  the  case  that  the  last  two  values  of  tc  that  have  been  tried  bracket  an 
acceptable  value  of  t .  That  is,  the  one  with  the  lower  value  of  te ,  ilow  must  satisfy  (4.2.1)  but  not  (4.2.2), 
while  the  one  with  the  higher  value  of  tc,  tup  ,  must  fail  (4.2.1).  In  this  case  an  admissible  tc  must  be  in 
(tlow ,  tup ),  and  it  is  identified  by  the  final.  Refine,  phase  of  the  line  search.  Let  8  =  tup  -  tlow .  A  typical 
Refine  phase  would  calculate 

.  _ s2  v/  plow)7  dc 

m  1  ow  /  ( tup  y-f  (tlow )  -  8  v/  (tlow  y  dc 

the  minimizer  of  the  one  dimensional  quadratic  interpolating  f  (tlow),  Vf  (tlow)T d< ,  and / {tup),  and  then 
set  rc=min/'max/'/w,  tlow+c^}  jup-aS}  where  typically  c  2=0.2.  This  phase  may  also  be  repeated  one  or 
more  times. 

In  theory  it  can  be  shown  that  our  Line  search  terminates  in  a  finite  number  of  iterations.  In  practice, 
very  little  work  usually  is  necessary.  Experience  has  shown  that  line  search  algorithms  with  relatively 
loose  tolerances  generally  produce  algorithms  that  require  fewer  total  number  of  function  and  derivative 
evaluations  to  reach  the  minimizer  than  algorithms  with  tight  tolerances.  Typical  line  search  algorithms  set 
a  ■  10“*  in  (4.2.1),  so  that  virtually  any  decrease  in  / (x)  is  acceptable,  and  P  between  0.7  and  0.9  in 
(4 22),  so  that  only  a  small  decrease  in  the  magnitude  of  the  directional  derivative  is  required.  Due  to 
these  tolerances,  te  *  1  is  admissible  much  of  the  time,  and  when  it  is  not,  generally  one  or  at  most  two 
more  values  of  tc  must  be  attempted.  Thus  the  three  procedures  described  above,  Init-Decrease,  Init- 
Increase,  and  especially  Refine,  are  used  only  infrequently. 

The  above  line  search  is  related  to  the  ones  described  by  Shanno  and  Phua  [1978b],  Fletcher  [1980], 
Dennis  and  Schnabel  [1983],  and  many  other  authors.  Many  line  searches  have  been  proposed,  especially 
variants  that  deal  differently  with  the  case  when  the  Hessian  is  indefinite;  see  e.g.  McCormick  [1977], 
More  and  Sorenson  [1979].  Some  line  searches  only  allow  /eSl  and  only  enforce  (4J2.1);  in  this  case 
Algorithm  4.2.1  is  much  simpler  as  the  Init-Increase  and  Refine  stages  and  the  check  for  (4.2.2)  are  elim¬ 
inated.  The  techniques  of  Shultz,  Schnabel,  and  Byrd  [1985]  show  that  such  a  line  search  still  leads  to  a 
globally  convergent  algorithm,  but  satisfaction  of  the  condition  (3.4.3)  for  positive  definite  secant  updates 
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is  not  guaranteed.  Finally,  some  line  search  algorithms  do  not  start  with  tc=l  if  the  previous  step  was  very 
short;  good  strategies  for  doing  this  are  closely  related  to  the  trust  region  approach  that  we  discuss  next. 

4J  Trust  Region  Methods 

Trust  region  methods  are  the  other  main  group  of  methods  for  ensuring  global  convergence  while 
retaining  fast  local  convergence  in  optimization  algorithms.  The  fundamental  difference  between  line 
search  and  trust  region  methods  is  how  they  combine  the  use  of  the  quadratic  model  with  the  choice  of  the 
step  length.  We  saw  in  Section  4.2  that  in  a  line  search  method,  the  quadratic  model  (Xc  +  d)  given  by 
(4.1.1)  is  used  to  obtain  a  search  direction  dc--Hc~xgc  (or  -(Hc  +Ec)~xgc  if  Hc  is  not  positive  definite), 
and  then  a  steplength  is  chosen.  The  procedure  for  choosing  the  steplength  does  not  make  further  use  of 
the  Hessian  (approximation)  Hc  . 

A  trust  region  method  takes  the  different  philosophy  that  one  first  chooses  a  trial  step  length  Ac  ,  and 
then  uses  the  quadratic  model  to  select  the  best  step  of  (at  most)  this  length  for  the  quadratic  model  by  solv¬ 
ing 

minimize  m<(xc  +  s)=f  M +  Vf (,Xc)T s  +  XsT Hcs  (4.3.1) 

**R* 

subject  to  Hie  II S  Ac 

The  trial  step  length  A*  is  considered  an  estimate  of  how  far  we  trust  the  quadratic  model,  hence  it  is  called 
a  trust  radius  and  the  resultant  method  is  called  a  trust  region  method.  We  will  see  below  that  Ac  is  closely 
related  to  the  length  of  the  successful  step  at  the  previous  iteration,  and  may  be  adjusted  as  the  current 
iteration  proceeds.  First  we  describe  the  solution  to  (4.3.1)  in  Theorem  4.3.1.  An  early  proof  of  much  of 
Theorem  4.3.1  is  given  in  Goldfeldt,  Quandt,  and  Trotter  [1966];  other  seminal  references  include  Gay 
[1981]  and  Sorensen  [1982], 

Theorem  4.3.1  Let  gc  =  V/  (x< )e  R" ,  Hee  R""  symmetric,  Ac  >  0  .  Let  Xje  R  denote  the  smallest  eigen¬ 
value  of  Hc  and  let  vj€R"  denote  the  corresponding  eigenvector.  Then  if  Hc  is  positive  definite  and 
ll//e”*gc  II  £  Ac ,  5e  is  the  unique  solution  to  (4.3.1).  Otherwise  the  solution  to  (4.3.1)  satisfies 

lUcItAc  and 
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(He  +V±I)Sc=-gc 

for  some  p*  £  0  where  Hc  +  p*/  is  at  least  positive  semi-definite.  Furthermore  either  Hc  +  | a*/  is  positive 
definite  and 

Sc=~(Hc  +  iLcI)-'gc  (4.3.2) 

for  the  unique  p*  >  max  (0,-Xi )  for  which  llrc  IfcAc  ,  or  p«  =  -Xi  and 

S'=-(HC  -  hirge  +  cov  .  (4.3.3) 

where  +  denotes  the  Moore-Penrose  pseudouniverse,  and  cue  R  is  chosen  so  that  llrc  H=Ac  •  If  Hc  is  positive 

definite,  the  solution  must  be  given  by  (4.3.2);  the  case  (4.3.3)  only  occurs  if  Hc  is  indefinite  and 

II (He  +  pe/)"1£cll  <  A*  for  all  p*  2  -Xi . 

Theorem  4.3.1  indicates  several  differences  between  the  step  taken  in  line  search  and  trust  region 
methods.  From  (4.3.2)  we  see  that,  even  if  Hc  is  positive  definite,  the  trust  region  step  is  not  always  in  the 
Newton  direction  -Hc~l  V /  (xc)  .  In  fact,  it  is  straightforward  to  show  that  for  all  p>  max  {0,-Xi  ), 
IJ(A/c  +  p/)-‘V/  (xc  )H  is  a  monotonically  decreasing  function  of  p.  Thus  as  Ac-»0,  p-*»,  and 
~iHc  +  p/)~‘V/ (xc)  -»  -V/ (xc )/p*  Therefore  for  small  Ac ,  the  uust  region  step  is  nearly  in  the  steepest 
descent  direction,  while  as  Ac  increases,  the  trust  region  step  approaches  and  ultimately  becomes  the  New¬ 
ton  step  -WC-‘V/  (Xc)  as  long  as  Hc  is  positive  definite.  This  is  depicted  in  Figure  4.3.1. 


Figure  4.3.1  The  trust  region  curve 
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A  second  difference  between  line  search  and  trust  region  methods  is  how  they  deal  with  the  case 
when  Hc  is  not  positive  definite.  Since  the  minimization  of  an  indefinite  or  negative  definite  quadratic 
model  is  not  mathematically  well-posed,  a  line  search  method  must  perturb  a  non-positive  definite  Hessian 
to  be  positive  definite,  as  was  seen  in  Section  4.2.  On  the  other  hand,  the  minimization  of  such  a  model 
within  some  closed  region  is  well  defined  and  reasonable,  and  this  is  what  the  trust  region  method  does. 
Indeed,  even  if  xc  is  a  maximum  or  saddle  point,  (4.3.1)  is  well-defined  with  a  solution  given  by  (4.3.3). 

The  above  discussion  indicates  two  attractive  properties  of  trust  region  methods.  First,  small  steps 
are  in  the  steepest  descent  direction,  the  best  direction  for  sufficiently  small  steps,  while  the  Newton  step  is 
used  when  it  is  within  the  trust  region  and  Hc  is  positive  definite,  thereby  hopefully  preserving  fast  local 
convergence.  We  will  see  that  this  property  is  retained  by  all  the  practical  approximations  to  the  ideal  trust 
region  step  that  are  discussed  later  in  this  section.  Second,  the  trust  region  method  deals  naturally  with 
indefinite  Hessian  matrices.  This  will  be  seen  to  lead  to  stronger  convergence  results  than  were  possible 
for  the  line  search  methods  of  Section  4.2. 

On  the  other  hand,  the  ideal  trust  region  step  described  in  Theorem  4.3.1  is  difficult  to  calculate.  The 
main  difficulty  is  that  there  is  no  closed  formula  that  gives  the  unique  p«  >  max  (0,-A.i)  for  which 
ll(//c  +  Pc/)-1  V/ (xc )I1=AC  .  Instead,  this  p*  must  be  calculated  by  an  iterative  process  with  each  iteration 
requiring  the  Cholesky  factorization  of  a  matrix  of  the  form  Hc  +  p/  .  In  contrast,  the  line  search  metnods 
of  Section  4 2  require  only  one  matrix  factorization  per  iteration.  Furthermore,  it  is  possible  that  the  step 
(4.3.3)  will  be  required  which  necessitates  an  eigenvalue-eigenvector  calculation.  This  case  is  rare,  espe¬ 
cially  in  finite  precision  arithmetic,  but  in  cases  that  are  close  to  this  the  calculation  of  (4.3.2)  becomes 
more  difficult. 

For  these  reasons,  efficient  computational  implementations  of  trust  region  methods  solve  (4.3.1) 
approximately.  Before  we  present  these  approximate  soiuuon  methods,  we  discuss  the  overall  schema  of  a 
trust  region  method  including  the  adjustment  of  the  trust  radius  Ac ,  and  the  convergence  properties  of  a 
method  that  uses  this  schema  while  solving  (4.3.1)  exactly.  This  theorem  will  help  justify  our  continued 
interest  in  these  methods.  We  will  then  see  that  these  convergence  properties  are  retained  when  using  vari¬ 
ous  efficient,  approximate  trust  region  steps  in  place  of  the  exact  solution  to  (4.3.1). 
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After  the  trust  region  sc  is  calculated,  a  trust  region  method  must  evaluate  /  (xc  +  sc )  to  see  whether 
xc  +  sc  is  a  satisfactory  next  iterate.  It  may  not  be  if  the  quadratic  model  does  not  accurately  reflect  /  (x ) 
within  the  trust  region.  In  this  case  the  trust  radius  is  decreased  and  the  trust  region  step  recalculated.  Oth¬ 
erwise,  Xc  +  sc  becomes  the  next  iterate  and  the  new  trust  radius  must  be  calculated.  Such  an  approach  is 
outlined  in  Algorithm  4.3.1. 

The  reader  will  recognize  the  step  acceptance  condition 

actual -reduction  <CL\  (predicted -reduction)  (4.3.4) 

as  being  very  similar  to  the  sufficient  decrease  condition  (4.2.1)  used  in  line  search  algorithms.  The  only 

difference  is  that  the  second  order  term  ^sjHcsc  is  included  on  the  right  hand  side  of  (4.3.4).  Again, 

a^lO"4  is  typical.  No  analog  to  condition  (4.2.2)  is  needed  by  trust  region  methods  because  the  strategy 
for  adjusting  the  trust  radius  prevents  the  step  from  being  too  short 

If  (4.3.4)  is  failed,  the  current  iteration  is  repeated  with  a  smaller  trust  radius.  The  procedure  for 
decreasing  Ac  is  similar  or  identical  to  the  procedure  Init- Decrease  for  decreasing  te  in  a  line  search. 


Algorithm  423.1  Trust  Region  Iteration 

Given  /  :  R*  -»R,  Xc ,  /  (Xc ),  V/  (xc ) ,  Hessian  (approximation)  Hc  , 
trust  radius  Ac  >  0,0<cti<a2<l ,  O<C1<C2<I<C3<C4 

done  :=  false 
Repeat 

se  :=  exact  or  approximate  solution  to  (4.3.1) 
evaluate  f  (xc  +sc) 

actual  -reduction  :=  /  (xc  +  sc )  -  /  (xc ) 
predicted-reduction  :=  V/  (Xc)T se  +  ^-sjHcsc 

If  actual -reduction  S  oti  predicted -reduction  Then 

done  :=  true 
x*  :=  xc  +  sc 

If  actual  -reduction  <  ct2  predicted  -reduction  Then 

A*  :=  Increase- A  (Ae)  (*  A*mem  [C3,  CiJA*  *) 

Else  A*  :=  Ac  (*orA*mem(ci,  l]Ac  *) 

Else  Ac  :=  Decrease-A  (Ac )  (*  new  Ac  mem  (c  1,  C2]oId  Ac  *) 

Until  done  =  true 


Typically,  quadratic  interpolation  (equation  (4.2.8))  with  a  safeguard  such  as  new  Ac  6  [0.1,  0.5]  old  Ac  is 
used. 

If  (4.3.4)  is  satisfied,  Xe  +  sc  is  acceptable  as  the  next  iterate  x+  and  the  new  trust  radius  A*  must  be 
determined.  Algorithm  4.3.1  increases  A+  over  A:  if  the  quadratic  model  and  function  have  agreed  quite 
well;  the  condition  actual -reduction  £  02  (predict -reduction)  tests  this  with  a  2=0. 75  typical.  Some 
methods  instead  allow  the  current  iteration  to  be  continued  with  this  larger  trust  radius  in  this  case,  which 
complicates  the  description  considerably  and  doesn’t  affect  the  theoretical  properties.  This  may  help  in 
practice  by  saving  gradient  evaluations.  In  either  case,  the  Increase -A  procedure  usually  doubles  Ac ,  analo¬ 
gous  to  the  Init-Increase  portion  of  the  line  search.  Otherwise,  A»=A^  in  Algorithm  4.3.1.  Some  methods 
may  set  the  new  A+  <  Ac  if  agreement  between  the  actual  function  and  the  model  was  rather  poor,  for 
example  if  (4.3.4)  was  failed  with  a=0.1.  This  also  does  not  affect  the  method’s  convergence  properties. 

Theorem  4.3.2  gives  the  main  global  and  local  convergence  properties  of  a  method  based  on  the  trust 
region  iteration  of  Algorithm  4.3.1,  in  the  case  where  (4.3.1)  is  solved  exactly.  Similar  results  may  be 
found  in  many  references  including  Fletcher  [1980],  Gay  [1981],  and  Sorensen  [1982]. 

Theorem  4 .3.2  Let  /  :  R"  -*  R  be  twice  continuously  differentiable  and  bounded  below.  Also,  for  xoe  R" 
and  some  Pi,  P2  >  0,  let  V2/ (x)  be  uniformly  continuous  and  satisfy  HV2/ (x  )ll  <  Pi  for  all 
xe  (xe R*  :  / (x)  5/ (xo)J  .  Let  (x*  ]  be  the  sequence  produced  by  iterating  Algorithm  4.3.1  starting  from 
xo  .  and  using  Hc  =  V2/  (Xe )  or  any  symmetric  approximation  with  II Hc  II  <  p2  at  each  iteration,  and  the  exact 
solution  to  (4.3.1)  to  calculate  dc .  Then  ^limJIV/ (x*)lfc=0  .  If  in  addition  each  Hc=V*f  (xe) ,  then  for  any 

limit  point  x.  of  the  sequence  (x* ) ,  V/  (x.  )=0  and  V2/  (x. )  is  at  least  positive  semi-definite.  Furthermore 
if  each  Hc=V1f(xe),thcn  if  (x* }  converges  to  x«,  V2/  (x»)  is  positive  definite,  and  V2/  (x)  is  Lipschitz 
continuous  around  x. ,  then  the  rate  of  convergence  is  q  -quadratic. 

Theorem  4.3.2  shows  that  a  trust  region  method  that  solves  the  trust  region  problem  exactly  has 
attractive  convergence  properties.  The  same  first  order  result  is  established  as  for  line  search  methods,  and 
no  assumption  about  the  condition  number  of  Hc  is  needed.  In  addition,  if  exact  second  derivatives  are 
used,  then  the  second  order  necessary  conditions  for  a  minimum  are  satisfied  by  any  limit  point,  which 
means  that  saddle  points  and  maxima  arc  avoided.  The  analysis  also  shows  that  near  a  local  minimum  wuh 


54 


V2/  (x- )  positive  definite,  asymptotically  the  trust  region  constraint  becomes  inactive  so  that  only  Newton 
steps  are  taken  and  local  quadratic  convergence  is  retained. 

These  nice  convergence  properties  help  motivate  the  interest  in  trust  region  methods  which  follow 
the  general  schema  of  Algorithm  4.3.1,  but  where  the  step  is  calculated  by  a  considerably  more  efficient 
procedure  than  that  required  for  the  exact  solution  of  problem  (4.3.1).  There  are  two  obvious  relaxations  to 
problem  (4.3.1)  that  may  make  it  easier  to  solve  :  the  trust  region  constraint  may  be  satisfied  only  approxi¬ 
mately,  and/or  the  quadratic  model  may  be  minimized  only  approximately.  We  will  see  that  both  may  be 
relaxed  substantially  without  weakening  the  theoretical  properties  of  the  method. 

In  fact,  two  general  classes  of  efficient  methods  for  approximately  solving  (4.3.1)  have  arisen, 
corresponding  to  these  two  possible  relaxations  of  (4.3.1).  In  the  first,  approximate  optimal  step  methods, 
mainly  the  trust  region  constraint  is  relaxed;  the  step  generally  is  still  of  the  form  ~{HC  +  p/)_1V/  (xc)  for 
some  positive  definite  Hc  +  \xl .  These  methods  are  summarized  below.  In  the  second,  dogleg  methods,  the 
minimization  of  the  quadratic  model  is  relaxed;  we  defer  the  consideration  of  these  methods  until  later  in 
this  section. 

Hebden  [1973]  and  More  [1978]  were  the  first  to  construct  efficient  approximate  optimal  step 
methods.  They  developed  an  efficient  procedure  for  finding  a  p  >  0  for  which 

ll(//c+p/)->V/(xf)ll«Ac  (4.3.5) 

in  the  case  when  Hc  is  positive  definite  and  ll//e-1V/  (xc  )l!  >  Ac  .  Their  algorithms  are  based  on  applying 

Newton’s  method  in  p  to 


following  the  observation  that  (4.3.6)  is  more  nearly  linear  in  p  near  the  desired  solution  than  is  (4.3.5). 
This  results  in  the  p-iteration 


(4-3.7) 


IM2(IM-Ac)  a ,  - 

^  ~  4  Ac  (Jc'(//e  +p/)->je)  (4-3-7) 

where  sc  =  ~{HC  +  p/)_l  V/  (xc),  which  requires  one  factorization  of  a  matrix  of  the  form  Hc  +\xl  for  each 


p-iteralion.  Typically  the  trust  region  constraint  is  relaxed  to  llrc  lie  [0.9,  1.1]  Ac  ;  then  usually  only  one  or 


two  iterations  of  (4.3.7),  and  the  same  number  of  matrix  factorizations,  arc  required  for  each  iteration  of 


the  optimization  algorithm. 
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Several  authors,  including  Gay  [1981],  Sorensen  [1982],  and  More  and  Sorensen  [1983],  have  inves¬ 
tigated  generalizations  of  these  approximate  optimal  step  methods  that  extend  to  the  case  when  Hc  is 
indefinite.  More  and  Sorensen  present  an  efficient  algorithm  that  guarantees  that  their  approximate  solu¬ 
tion  to  (4.3.1)  reduces  the  quadratic  model  md Xe  +  s)  by  at  least  y  times  the  amount  that  the  exact  solution 
to  (4.3.1)  would,  for  any  fixed  y  <  1.  Their  algorithm  combines  the  Hebden-More  procedure  mentioned 
above  with  a  use  of  the  LINPACK  condition  number  estimator  to  obtain  a  satisfactory  solution  when  the 
exact  solution  is  in  or  near  the  "hard  case"  (4.3.3).  No  eigenvalue/eigenvector  calculations  are  required. 

They  show  that  their  method  still  generally  requires  only  1-2  matrix  factorizations  per  iteration,  and  that  it 
retains  the  convergence  properties  of  Theorem  4.3.2. 

The  analyses  of  these  approximate  optimal  step  methods  are  subsumed  by  Theorem  4.3.3  below,  due 
to  Shultz,  Schnabel,  and  Byrd  [1985].  Similar  first  order  results  are  proven  by  Powell  [1970,  1975]  and 
Thomas  [1975].  Details  of  the  interpretations  that  are  given  following  Theorem  4.3.3  are  also  found  in 
Shultz,  Schnabel,  and  Byrd  [1985]. 

Theorem  4.3.3  Let  the  assumptions  in  the  first  two  sentences  of  Theorem  4.3.2  hold.  Let  [x* )  be  the 
sequence  produced  by  iterating  Algorithm  4.3.1  starting  from  x<>  ,  using  //C=V2/  (xc)  or  any  symmetric 
approximation  with  II Hc  II  £  p2  at  each  iteration.  If  there  exist  c  i ,  c  2  >  0  such  that  each  sc  satisfies, 

V/fXcfsc  +  ^ slH'S e  £- dllV/fXc)!! min  [Ac,  C2"^)!l  ]  ,  (4.3.8) 

then  jimllV/  (xt)ll=0  .  If  in  addition  each  Hc=V*f  (xc )  and  there  exists  c  3  >  0  such  that  each  dc  satisfies 

(*c  )T se  +  \sjHc  se  £  -c  s(-*.i  (Hc  ))Ac2  (4.3.9) 

where  denotes  the  smallest  eigenvalue  of  He  ,  then  for  any  limit  point  x.  of  (x* }  ,  V/ (x.  )=0  and 

Vyfx.)  is  at  least  positive  semi-definite.  Also,  if  each  //e=V2/ (Xc),  each  se  satisfies  (4.3.8),  and  there 
exists  C4  e  (0, 1]  such  that  rc=-/fc~lV/ (xc)  whenever  Hc  is  positive  definite  and  (xc)ll  <  c4Ac  , 

then  if  [x* }  converges  to  x.  with  V2/  (x. )  positive  definite  and  V2/  (x)  Lipschitz  continuous  around  x. , 
then  the  rate  of  convergence  is  q  -quadratic. 

Theorem  4.3.3  contains  two  equations,  (4.3.8)  and  (4.3.9),  that  say  what  conditions  a  trust  region  step 
needs  to  satisfy  in  order  to  be  globally  convergent  to  points  satisfying  the  first  and  second  order  necessary 
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conditions  for  minimization,  respectively.  Equation  (4.3.8)  gives  a  condition  on  the  sufficient  use  of  des¬ 
cent  directions  in  order  to  assure  first  order  global  convergence.  With  Ci=y  and  C2=l,  it  implies  that  the 

step  se  provides  at  least  as  much  decrease  in  the  quadratic  model  as  the  best  permissible  step  in  the  steepest 
descent  direction.  (Here  "permissible"  means  "obeying  the  trust  region  constraint").  This  interpretation 
forms  pan  of  the  motivation  for  the  dogleg  methods  for  approximating  (4.3.1)  that  are  discussed  below.  By 
using  smaller  ci  and  a  in  (4.3.8),  condition  (4.3.8)  says  that  each  se  provides  at  least  a  fixed  fraction  of  the 
decrease  in  the  quadratic  model  that  would  be  obtained  by  the  best  permissible  step  in  some  direction  pc  of 
"bounded"  descent  (i.e.  the  angle  between  pe  and  V/  (x£)  is  uniformly  bounded  above  by  a  constraint  < 
90°).  This  is  all  that  is  needed  to  assure  jim  (V/ (x*))=0  .  Thus  Theorem  4.3.3  also  applies  to  a  line  search 

method  where  the  trust  region  bound  Ar  is  used  to  determine  the  steplength. 

Equation  (4.3.9)  gives  a  condition  on  the  sufficient  use  of  negative  curvature  directions  that,  in  con¬ 
junction  with  (4.3.8),  assures  global  convergence  to  a  point  satisfying  second  order  necessary  conditions 
for  minimization.  Equation  (4.3.9)  can  be  interpreted  as  saying  that,  whenever  Hc  is  indefinite.  sc  provides 
at  least  a  fixed  fraction  of  the  reduction  in  the  quadratic  model  that  would  be  obtained  by  the  best  permissi¬ 
ble  step  in  some  direction  vc  of  "bounded"  negative  curvature  (i.e.  ( vjHc  vc  )/(X,  vcrvc )  is  uniformly  bounded 
below  by  a  fixed  positive  constant).  This  constitutes  a  considerable  relaxation  of  the  "hard  case"  (4.3.3)  in 
the  exact  solution  of  (4.3.1). 

Thus  any  trust  region  method  that  uses  the  schema  of  Algorithm  4.3.1  and  chooses  its  steps  to  satisfy 
conditions  (4.3.8-9)  has  strong  global  convergence  properties.  It  is  straightforward  to  show  that  (4.3. 8-9) 
implies  the  condition  on  a  trust  region  step  used  by  More  and  Sorensen  [1983],  that  the  reduction  in  the 
quadratic  model  by  je  be  at  least  a  fixed  fraction  of  the  reduction  from  solving  (4.3.1)  exactly.  The  con¬ 
verse  is  only  true  under  stronger  assumptions  on  Hc  (see  Byrd,  Schnabel,  and  Shultz  [1987]).  Now  we 
return  to  the  second  class  of  methods  for  approximately  solving  (4.3.1),  dogleg  methods,  which  arc  now 
easily  seen  as  another  way  to  satisfy  conditions  (4.3.8-9). 

The  earliest  trust  region  method,  of  Powell  [1970],  is  the  original  dogleg  method.  Given  a  quadratic 
model  me(xe  +  s)  with  Hc  positive  definite,  and  >  0,  it  selects  xc  +  sc  to  be  the  Newton  point  x*  = 
Xc  -  He~'Vf  (xc )  if  it  is  inside  the  trust  region.  Otherwise  it  selects  xc  +  sc  to  be  the  point  on  the  piecewise 


linear  curve  connecting  xe ,  xcP ,  and  xn  which  is  distance  A«  from  xc.  (See  Fig.  4.3.2.)  Here  xcf  is  the 
Cauchy  point,  the  minimum  of  the  quadratic  model  in  the  steepest  descent  direction  -V/  (xc)  from  xc .  (It 
is  straightforward  to  show  that  ILcc,  -  xc  II  <  11%  -  xc  II.  so  that  the  intersection  of  the  dogleg  curve  with  the 
trust  radius  is  unique.)  Dennis  and  Mei  [1979]  constructed  a  generalization,  the  double  dogleg  method, 
which  selects  xc  +  sc  to  be  the  unique  point  on  the  piecewise  linear  curve  connecting  Xc ,  xcp ,  yx# ,  and  xn 
which  is  distance  Ac  from  xe  ,  or  xc  +  rc  =  xn  if  xn  is  inside  the  trust  region,  for  a  particular  y  <  1 .  For 
either  method,  it  can  be  shown  that  as  one  moves  along  the  piecewise  linear  curve  from  xc  to  x* ,  the  dis¬ 
tance  from  Xc  increases  and  the  value  of  the  quadratic  model  decreases. 

Thus  these  dogleg  methods  take  small  steps  in  the  steepest  descent  direction  when  the  trust  radius  i" 
small,  take  Newton  steps  when  the  trust  radius  is  sufficiently  large,  and  take  steps  in  a  linear  combination 
of  the  steepest  descent  and  Newton  directions  otherwise.  Due  to  the  use  of  steepest  descent  steps  when  Ac 
<,  llxcp  -xc  II  and  to  the  monotonic  decrease  of  the  quadratic  model  along  the  entire  dogleg  curve,  they 
always  obtain  at  least  as  much  descent  on  the  quadratic  model  as  the  best  steepest  descent  step  of  length  at 
most  Ac.  Thus  they  obey  (4.3.8)  and  by  Theorem  4.3.3  are  globally  convergent  in  the  sense  that  the 
sequence  of  gradients  of  the  iterates  converges  to  zero. 


-V/(xe) 


Figure  4.3.2  The  dogleg  curve 
(dotted  pan  is  double  dogleg  modification) 


An  attraction  of  these  dogleg  methods,  in  comparison  to  the  approximate  optimal  step  methods  dis¬ 
cussed  above,  is  that  they  only  require  one  matrix  factorization  (of  Hc)  per  iteration.  All  the  remaining  cal¬ 
culations  are  easily  seen  to  require  at  most  0  (n2)  operations,  and  no  additional  factorizations  are  required 
if  the  trust  region  must  be  decreased  during  the  current  iteration.  On  the  other  hand,  neither  version 
described  above  makes  explicit  use  of  an  indefinite  Hc  or  satisfies  (4.3.9).  Thus  no  second  order  global 
convergence  results  can  be  proven  foi  them. 

These  dogleg  methods  are  closely  related  to  minimizing  the  quadratic  model  rrtc  (xc  +  sc )  over  the 
two  dimensional  subspace  spanned  by  the  steepest  descent  and  Newton  directions,  subject  to  the  trust 
region  constraint  This  two  dimensional  trust  region  problem  also  is  easy  to  solve  using  oniy  the  factoriza¬ 
tion  of  Hc,  and  the  inclusion  of  the  steepest  descent  direction  assures  satisfaction  of  (4.3.8)  and  hence  first 
order  global  convergence.  Shultz,  Schnabel,  and  Byrd  [1985]  propose  an  algorithm  along  these  lines 
where,  if  Hc  is  indefinite,  the  two  dimensional  subspace  is  changed  to  the  one  spanned  by  -V/  (xc)  and 
some  direction  of  bounded  negative  curvature  which  is  fairly  efficient  to  compute.  Thus  the  algorithm 
obeys  (4.3.9)  as  well  and  has  the  same  theoretical  convergence  properties  as  an  approximate  optimal  step 
method.  Byrd,  Schnabel,  and  Shultz  show  that  an  optimization  algorithm  based  on  this  approach  is  very 
competitive  with  a  modem  approximate  optimal  step  method  in  robustness  and  efficiency. 

In  practice  both  approximate  optimal  step  methods  and  various  dogleg  methods  are  used  in  solving 
unconstrained  optimization  problems  and  in  other  contexts.  Some  additional  comments  on  their  relative 
merits  are  contained  in  Section  4.4. 

4.4  Comparison  of  Line  Search  and  Trust  Region  Methods 

Sections  42  and  4.3  have  presented  two  classes  of  methods,  line  searches  and  trust  regions,  for 
obtaining  globally  convergent  unconstrained  optimization  methods  while  also  retaining  fast  local  conver¬ 
gence.  The  reasons  for  presenting  both  approaches  arc  that  neither  appears  to  be  consistently  superior  to 
the  other  in  practice,  that  both  are  used  in  modem  software,  and  that  both  can  be  expected  to  play  important 
roles  in  the  future  development  of  optimization  methods.  This  section  elaborates  briefly  on  these  remarks. 
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Gay  [1983]  and  Schnabel,  Koontz,  and  Weiss  [1985]  have  conducted  comparisons  of  line  search  and 
trust  region  codes  on  a  standard  set  of  test  problems  from  More,  Garbow,  and  Hillstrom  [1981].  Gay  com¬ 
pared  a  BFGS  method  utilizing  a  line  search  with  a  BFGS  method  using  a  double-dogleg  trust  region  step. 
Schnabel,  Koontz,  and  Weiss  tested  methods  using  finite  difference  Hessians  and  methods  using  BFGS 
approximations,  in  both  cases  comparing  line  search,  double-dogleg,  and  approximate  optimal  trust  region 
steps.  Both  studies  showed  that  while  there  can  be  considerable  differences  between  the  performance  of 
line  search,  dogleg,  and  optimal  step  methods  on  individual  problems,  no  one  method  is  consistently  more 
efficient  or  robust  than  any  other.  Indeed,  the  average  differences  in  efficiency  between  the  line  search  and 
trust  region  methods  tested  were  quite  small,  and  they  had  similar  success  rates. 

In  modem  numerical  software  libraries,  one  finds  both  line  searches  and  trust  regions  used  in  con¬ 
junction  with  both  (finite  difference)  Hessians  or  BFGS  approximations.  Philosophically,  some  people 
prefer  to  use  line  searches  in  conjunction  with  BFGS  methods  because  the  necessary  condition  (3.4.3)  for 
positive  definite  updates  can  be  guaranteed  to  be  satisfied  at  each  iteration;  in  trust  region  methods  no  such 
guarantee  is  possible  and  occasionally  (3.4.3)  is  not  satisfied  and  the  update  must  be  skipped.  Similarly, 
some  people  advocate  using  trust  region  methods  in  codes  where  the  (finite  difference)  Hessian  is  used, 
because  a  "natural"  treatment  of  indefiniteness  is  possible  and  it  can  be  guaranteed  that  saddle  points  and 
local  maxima  are  avoided.  But  as  the  previous  paragraph  has  indicated,  neither  of  these  theoretical  argu¬ 
ments  has  been  shown  to  correspond  to  any  significant  computational  advantage. 

Algorithm  developers  in  areas  such  as  constrained  optimization,  least  squares  data  fitting,  and  large 
scale  optimization  often  need  to  choose  between  using  line  searches  or  trust  regions  in  developing  new 
codes.  Some  trade-offs  are  fairly  consistent.  Line  searches  often  are  a  little  simpler  to  code,  but  sometimes 
it  is  not  clear  how  to  deal  with  indefiniteness,  rank  deficiency,  or  other  factors  that  may  cause  the  line 
search  direction  to  be  in  an  unacceptable  direction.  Trust  region  methods  often  offer  a  mathematical  solu¬ 
tion  to  these  problems,  but  usually  require  some  additional  linear  algebra  cost.  In  addition  it  someumes  is 
challenging  to  construct  efficient,  approximate  solution  algorithms  for  the  appropriate  trust  region  problem. 
The  result  is  that  both  approaches  arc  used;  for  example  there  currently  is  considerable  research  activity  in 
both  line  search  and  trust  region  methods  for  nonlincarly  constrained  optimization.  The  two-dimensional 
trust  region  technique  mentioned  at  the  end  of  Section  4.3  seems  to  offer  a  good  compromise  in  cases 
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where  the  trust  region  approach  seems  desirable  to  assure  acceptable  step  directions,  but  an  (approximate) 
optimal  solution  to  the  exact  trust  region  problem  is  very  difficult  or  expensive  to  find 

One  case  where  there  appears  to  be  a  discernible  difference  between  line  search  and  trust  region 
methods  is  in  Gauss-Newton  methods  for  nonlinear  least  squares  (see  Section  6.2).  In  this  case  the  under¬ 
lying  local  method  is  at  best  linearly  convergent  on  most  problems.  For  such  algorithms,  trust  region  algo¬ 
rithms,  which  may  be  viewed  as  combining  two  linearly  convergent  directions,  the  standard  Gauss-Newton 
direction  and  the  steepest  descent  direction,  appear  generally  to  be  more  robust  and  efficient  than  line 
search  algorithms  which  only  use  the  Gauss-Newton  direction.  For  a  detailed  discussion  of  such  algo¬ 
rithms,  see  Dennis  and  Schnabel  [1983],  Ch.  10. 
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5.  Non-Taylor  Scries  Methods 

This  section  presents  two  fundamentally  different  algorithmic  approaches  that  have  proven  them¬ 
selves  useful  for  unconstrained  minimization.  First,  we  will  describe  the  Nelder-Mead  simplex  algorithm. 
Nelder  and  Mead  [1965],  an  effective  pattern  search  technique  for  problems  of  very  low  dimension  which 
is  beloved  of  users  but  generally  ignored  by  optimization  researchers.  Then,  we  will  provide  a  unifying 
framework  for  the  proliferation  of  conjugate  direction  algorithms  that  have  been  devised  for  solving  prob¬ 
lems  with  large  numbers  of  variables. 

5.1  The  Nelder-Mead  Simplex  Algorithm 

The  Nelder-Mead  algorithm  moves  a  simplex  through  the  domain  space  with  the  goal  of  getting  the 
function  minimizer  x.  in  the  interior  of  the  simplex.  Once  ad  hoc  tests  indicate  that  the  minimizer  has  been 
surrounded,  the  algorithm  shrinks  the  simplex  toward  the  vertex  corresponding  to  the  lowest  function  value 
and  returns  to  the  process  of  trying  to  get  x.  into  the  interior  of  the  (smaller)  simplex. 

We  will  confine  ourselves  here  to  a  description  of  the  four  basic  moves  of  the  algorithm.  Each  itera¬ 
tion  begins  with  a  set  of  n  + 1  current  points  x*1 . in  general  position,  i.e.,  the  convex  hull  Sc  of 

(xt’,...,x*+1)  is  an  n -dimensional  simplex.  Furthermore,  these  vertices  are  assumed  to  be  sorted  on  their 

objective  function  values  so  that  /  (xi)if  (xi*1),  i  =  1 . n.  The  first  goal  of  each  iteration  is  to  replace 

the  worst  vertex  Xc"+1  with  a  better  one  by  moving  the  simplex  away  from  x?*1.  If  this  fails,  we  tacitly 
assume  that  we  arc  close  enough  to  the  minimizer  to  need  smaller  moves  for  improvement-  Thus,  we  keep 
the  best  vertex  and  shrink  the  simplex  along  each  edge  by  replacing  each  of  the  other  vertices  by  its  aver¬ 
age  with  the  old  best  vertex.  We  then  evaluate  /  at  the  n  new  vertices  and  sort  and  label  them  to  obtain 

<xl  .....xj*1  >.  The  convention  we  use  is  that  the  older  vertex  is  numbered  lower  when  two  vertices 
have  equal  function  values. 

We  have  described  the  shrinkage  step  through  which  the  algorithm  tries  to  close  in  on  x.;  now  wc 
describe  the  moves  aimed  at  getting  the  larger  simplex  close  by  moving  away  from  the  worst  vertex. 

The  first  trial  step  in  each  iteration  is  to  consider  the  reflection  x'c  -  2xc  -xc**‘  of  x^*1  through  the 

centroid  £  =  —  fxj  of  the  best  a -face  of  5C .  If  success  is  so  great  that  /  (xcr)  </(xc;),  then  we  try 
n  ,T! 
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expanding  the  simplex  in  the  same  direction  by  testing  / (x,.*)  < /  (xc’),  where  xt  =  2x/  -x"*1.  If  the  expan¬ 
sion  is  successful,  we  drop  x"'*1  in  favor  of  x/.  otherwise  we  drop  xc"*'  in  favor  of  x'\  and  sort  the  vertices 
to  obtain  5*.  There  are  two  ihings  to  note  about  a  successful  expansion  step.  The  first  is  that  we  do  not 
continue  in  the  same  vein  even  if  /  (x/)  <  /  (x/)  <  /  (Xe1).  This  is  because  we  want  to  avoid  a  simplex  that 
gets  elongated  enough  that  its  vertices  are  not  numerically  in  general  position.  The  second  point  is  that  we 
accept  the  expansion  vertex  even  if  it  is  not  as  good  as  the  reflection  vertex.  Thus,  the  best  approximate 
minimizer  we  have  found  so  far  might  not  be  retained  as  a  vertex.  This  is  in  keeping  with  our  view  of  try¬ 
ing  to  move  the  simplex  over  the  minimizer  before  we  start  to  close  in  on  it 

The  reflection  vertex  is  taken  without  an  expansion  attempt  if  /  (Xc')<,f  (x/)  < /  (Xc*).  In  this  case,  xc" 
will  become  xj*1  after  sorting.  If  /  (x/)2/  (x/*1),  then  we  try  once  to  contract  the  simplex  internally 
along  the  reflection  direction  by  testing/ (xcc)  < / (xc*).  where  the  contraction  vertex  is  xf  =  -j [xc  ^x"*1].  If 

the  contraction  fails,  then  we  shrink  the  simplex  as  explained  above.  If  it  succeeds,  then  we  replace  x*-1 
by  x/  and  son  to  prepare  for  the  next  iteration. 

There  is  one  more  possibility  to  consider  for  the  trial  of  a  reflection  step.  If  /  (xc*)  <  f  (xe )  <  /  (xc* 
then  we  can  see  immediately  that  if  we  replace  xc"*‘  by  xcr,  the  outcome  would  be  that  xj*1  would  be  x/ 
and  that  the  subsequent  reflection  step  would  be  rejected,  because  xi  =xc**1,  in  favor  of  a  trial  contraction. 
Thus,  we  pass  over  this  ‘shadow'  iteration  and  compute  the  indicated  shadow  contraction  directly  from  the 

current  vertices  as  x’’  -  ^-(x/-t- Jc ).  We  finish  the  iteration  exactly  as  we  did  for  a  regular  contraction. 

Many  users  have  suggested  minor  modifications  of  this  algorithm.  The  best  known  modifications 
and  the  history  of  the  algorithm  are  collected  in  Woods  [1985].  Woods  also  gives  a  novel  application  of 
one  of  the  three  major  advantages  of  the  algorithm.  He  applies  the  algorithm  to  multicriteria  optimization 
by  exploiting  the  fact  that  we  only  use  the  objective  function  /  to  decide  which  is  the  ‘better’  of  two  ver¬ 
tices.  In  common  use.  this  indicates  that  the  algorithm  should  be  robust  with  respect  to  noise  or  inaccura¬ 
cies  in  the  values  of  / .  For  example,  we  have  experience  with  an  engineer  who  was  able  ic  use  it  to 
resolve  parameters  to  .5%  after  only  reaching  5%  resolution  with  a  standard  finite-difference  Newton  code. 

The  three  main  strengths  of  the  Neldcr-Mcad  simplex  method  are  its  tolerance  of  function  noise,  its 
nonreliance  on  any  gradient  approximations,  and  the  extreme  simplicity  of  its  implementation.  It  takes  less 


than  100  statements  to  implement  in  any  high  level  language,  and  this  is  an  important  factor  in  its  popular¬ 
ity  with  users  who  still  distrust  black  boxes. 

The  major  weaknesses  of  the  algorithm  are  that  it  can  be  very  slow  for  more  than  about  5  variables, 
and  that  it  can  converge  to  a  nonminimizer.  The  only  example  of  the  latter  that  we  know  of  is  given  in 
Dennis  and  Woods  [1987]  and  it  is  mitigated  somewhat  by  the  nonminimizing  limit  being  a  point  where  the 
derivative  doesn't  exist.  Both  these  shortcomings  are  the  subject  of  current  research. 

It  is  worth  noting  that  the  example  referred  to  in  the  paragraph  above  shows  that  the  algorithm  is  not 
safe  for  nonsmooth  problems  despite  the  fact  that  it  uses  no  gradient  information.  Incidental  to  this  caveat 
is  that  the  example  is  convex. 

S2  Conjugate  Direction  Methods 

The  second  class  of  methods  that  we  discuss  in  this  section  is  the  conjugate  direction  algorithms.  It 
is  not  entirely  accurate  to  say  that  they  are  not  based  on  quadratic  models,  but  there  is  never  a  need  to  store 
a  full  Hessian.  Therefore,  these  methods  are  especially  suited  for  large  dimensional  problems  where  /  (x ) 
and  g  (x )  are  available  but  n  is  too  large  to  store  or  factor  an  n  xn  matrix  at  each  iteration. 

Conjugate  direction  algorithms  are  usually  presented  as  they  would  be  programmed.  This  demon¬ 
strates  their  most  important  property,  computational  simplicity  and  little  storage,  but  it  obscures  the 
geometric  elegance  that  better  helps  the  novice  gain  an  overview  of  the  whole  class  of  methods. 

Conjugate  direction  methods  are  most  simply  presented  as  methods  for  minimizing  strictly  convex 
quadratic  functions.  We  will  follow  the  point  of  view  taken  in  Dennis  and  Turner  [1986]  where  the  reader 
can  find  all  the  proofs  of  results  claimed  here.  We  will  adopt  the  standard  convenience  of  taking  x0=0. 

Assume  that  we  are  at  the  t1*  iterate  and  that  we  have  a  scheme  for  generating  a  descent  direction 

for  q(x)=^xTHx~hTx,  H  symmetric  and  positive  definite,  from  x*.  Suppose  that  x*  minimizes 
q  (x)  on  a  it -dimensional  subspace  spanned  by  the  previous  iterates.  Choose  x*+l  to  be  the  unique  minim- 
izer  of  q(x)  on  the  k  +  1-dimensional  subspace  formed  by  adding  d»*i  to  the  previous  spanning  set.  These 
two  sentences  characterize  the  conjugate  direction  algorithms.  It  is  interesting  to  note  that  this  point  of 
view  is  identical  to  the  definition  given  by  Cantrell  [1969],  Cragg  and  Levy  [1969],  and  Mielc  and  Cantrell 


[1969]  of  a  ‘memory  gradient’  method.  It  is  completely  developed  in  Dennis  and  Turner  [1986].  Nazareth 
[1986]  gives  a  corresponding  algorithm  for  general  minimization  problems. 


It  is  easy  to  show  that  if  we  define  at  each  iteration  p;  =x,  then  pjHp ,  =  0  for  \<,j  <i  <,k  and 

pfVq(xk)= 0  for  1  <j<k.  Thus,  solving  the  k  + 1  dimensional  minimization  problem  on  span 

[p . . Pk,dk* i]  for  z*+i  requires  the  solution  of  a  (k  +  l>dimensional  symmetric  positive  definite  linear 

system  which  is  diagonal  except  for  a  full  last  row  and  column.  This  system  can  be  solved  explicitly,  but 
the  expense  of  saving  all  the  previous  pi ’s  is  too  high  for  large  problems. 

If  one  can  arrange  to  have  dlHpl  -  0  for  the  oldest  p, ’s,  then  those  pt  do  not  have  to  be  saved 
because  they  are  not  needed  explicitly  in  solving  for  zt+i.  It  turns  out  that  dk+i--Vq(xk)=h  -Hxk 
accomplishes  this  purpose  for  1  Si  <,k  - 1  so  that  pk*\  is  a  linear  combination  of  p*  and  ck*\.  This  for¬ 
tunate  circumstance  follows  from  a  more  general  result. 

Let  B  be  an  arbitrary  matrix  and  select  each  d/+!  e  span  ( di,Bp\,...,Bpj }.  Then,  it  is  easy  to  show 

that  xk  minimizes  q(x )  on  span  [duBdi . Bk~xd\)  mK(d\,B and  that  Vq (xk )TBp;  =  0  for 

j  - 1 . k  - 1.  Thus,  if  we  choose  any  B  and  generate  dk+ j  so  that  dJ+\H  -~Vq  (xk  )TB ,  then  we  only  need 

xk,pk  and  dk^  to  generate 

Pk*\  =  d*+i  +  P*  p*,  (5.2.1) 

and 


x*+i  =  x*  +  a*  pt+i,  a*  =  -hr  u}X~  ■  (5.2.2) 

a^i«p*+i 

In  this  case,  the  pk's  are  to  be  thought  of  as  directions  rather  than  iterative  steps  as  they  were  defined 
above,  but  there  is  no  difficulty  introduced  into  the  discussion  above  by  doing  so.  Of  course  ±d*+]  must 
also  be  a  descent  direction  for  q  from  xk,  i.e.,  dl V<7(xt)*0.  The  choice  B=H  gives  rise  to 
d**i  =  - (x*)  mentioned  above.  This  is  called  the  conjugate  gradient  algorithm.  The  subspace 
K{d\,B , Jk — 1)  is  called  the  k -dimensional  Krylov  subspace  generated  from  d\  by  B . 


If  M  is  any  symmetric  positive  definite  matrix  for  which  dk+\  =  -M-]Vq(xk)  is  easy  to  compute, 
then  dk* i  is  a  descent  direction  and  corresponds  to  B  =M~lH .  The  resultant  method  is  called  the  precondi¬ 
tioned  conjugate  gradient  algorithm  and  M  is  called  the  preconditioner.  The  use  of  a  preconditioner  can 
significantly  improve  the  conjugate  gradient  algorithm.  Let  us  now  discuss  briefly  some  factors  in 
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choosing  M . 

Intuitively,  there  are  two  things  we  might  want  to  accomplish  in  oar  choice  of  M  and  hence  B .  We 
can  try  for  a  big  reduction  in  q  by  choosing  Af-1  to  be  a  good  approximation  to  W-1.  In  other  words,  we 
can  try  to  make  our  new  directions  dk  approximate  the  Newton  direction  for  q .  This  is  the  point  of  most 
iterative  methods,  like  SOR  or  Gauss-Seidel  for  example,  and  it  is  common  to  exploit  such  methods  as 
preconditioners  for  problems  where  they  were  once  used  alone.  In  such  preconditioners,  one  never  needs 
to  work  explicitly  with  M  since  M~iVq(xt)  is  the  iterative  step.  It  i*  worth  noting  that  SOR  does  not 
correspond  to  a  symmetric  positive  definite  preconditioner  M  and  so  SSOR,  which  involves  a  forward  and 
then  backward  sweep,  is  generally  used  because  it  does  correspond  to  a  symmetric  positive  definite  M . 
See  Golub  and  Van  Loan  [1983]. 

This  way  of  choosing  an  iterative  method  as  a  preconditioner  fa-  a  conjugate  direction  algorithm 
lends  itself  to  two  popular  points  of  view:  an  optimizer  would  feel  that  the  iterative  method  is  being  used 
to  accelerate  the  conjugate  direction  algorithm,  but  the  numerical  partial  differential  equations  solver 
would  be  more  likely  to  feel  that  conjugate  directions  is  being  used  to  accelerate  the  basic  iterative  method 
being  used  as  a  preconditioner. 

From  a  purely  matrix  algebra  point  of  view,  this  first  way  of  choosing  M ,  which  we  have  been  dis- 

_  i  _  i 

cussing,  corresponds  to  making  the  condition  number  of  M  7 HM  T  smatl,  since  this  is  the  Hessian  and 

i 

Af-'Vq (xk )  is  the  steepest  descent  direction  for  <7  at  x*  in  the  transformed  variables  x  '=M^x.  It  is  worth 
pointing  out  the  result  that  the  DFP  or  BFGS  applied  to  minimize  q (x)  with  exact  line  searches  and  initial 
Hessian  approximation  M  generates  exactly  the  same  points  in  exact  arithmetic  as  the  conjugate  gradient 
algorithm  preconditioned  by  M . 

The  second  point  of  view  in  choosing  M ,  and  hence  M~lH  =B ,  is  to  try  to  make  K (M-]h,B  ,n-l) 
have  the  smallest  possible  dimension,  say  p  «  n .  The  reason  is  that  in  this  case  the  algorithm  solves  the 
problem  in  p  steps.  This  can  be  seen  from  the  fact  that  Vq(xp)  is  orthogonal  to  K  (M~]h,B  ,p-l).  Thus, 
M^Vqixp)  cannot  lie  in  K(M~]h,B  ,p-l)  unless  it  is  zero.  Then  it  must  be  zero  since 
M~lVq(xp)€  K(M-'h,B  ,p)  and  the  Krylov  subspaces  have  stopped  increasing  their  dimension  so 
K(M~'h,B  ,p)=K{ AT1  h,B  ,p- 1). 
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If  we  ignore  the  influence  of  the  initial  direction  on  the  dimension  of  K {M~lh  ,B  ,/r-l),  then  an 
upper  bound  on  that  dimension  is  the  degree  of  the  minimal  polynomial  of  B .  This  is  easy  to  see  since  if 

m (B )  =  ^a, B '  =  0  is  the  minimal  polynomial  for  B ,  then  0=m(B)M~xh  is  a  nonzero  linear  combination 

of  thep  + 1  generators  for  K (M~lh ,B  ,p )  whose  dimension  must  be  less  than p  + 1.  Since B  is  similar  to  a 
symmetric  matrix,  p  is  the  number  of  distinct  eigenvalues  of  5  See  Hoffman  and  Kunze  [1971]. 

It  is  not  unusual  for  strictly  convex  quadratics  arising  from  discretized  partial  differential  equations 
to  be  solved  with  p  ~rt/ 103.  Such  spectacularly  successful  preconditionings  nearly  always  come  from  deep 
insight  into  the  problem  and  not  from  matrix  theoretic  considerations.  They  often  come  from  discretizing 
and  solving  a  simplified  problem. 

The  choice  of  preconditioners  for  optimization  problems  is  not  nearly  so  well  understood.  This  may 
be  because  most  of  our  effort  has  been  directed  toward  algorithms  and  software  for  general  library  use. 
We  have  generally  used  conjugate  direction  methods  only  for  nonquadratic  problems,  and  then  only  for 
problems  so  large  that  we  have  no  other  choice  (see  Section  6. 1 ). 

For  nonquadratics,  the  conjugate  gradient  formulas  are  generally  given  by 

_  _  T7^/_N.O  _  ff  _  Vf(xk)TVf  (xk) 

PM  - _v/  {xk  y +  pt  Pk ' p*  -  ty(**-,)7' ?/(**->)  (5~3) 

and 

**+i  =x»  +  a* •?*♦!.  a*  minimizes/  along p*+1.  (5.2.4) 

If  f  (x)  is  quadratic,  (5.2.3-4)  are  equivalent  to  (52.1-2),  but  no  matrix  is  required  by  (5.2.3-4).  In  general, 

the  line  search  is  not  done  exactly  but  it  has  to  be  done  fairly  accurately;  see  Gill,  Murray,  and  Wright 

[1981], 

Of  course,  the  conjugate  directions  do  not  remain  conjugate  in  finite  precision  implementations,  and 
there  is  no  reason  they  should  be  conjugate  for  nonquadratics.  The  standard  way  to  handle  this  is  to  save 
some  past  vectors  and  make  sure  dk  is  made  conjugate  with  respect  to  these  few  vectors  (see  Vinsome 
[1976],  Young  and  Jea  [1981]),  or  to  restart  the  method,  perhaps  by  taking  dk  periodically  to  be  a  linear 
combination  of  some  restart  vector  and  the  dk  that  would  have  been  chosen  if  a  restart  were  not  due.  See 
Beale  [1972]  or  Powell  [1977], 


6.  Some  Current  Research  Directions 


In  this  section  we  will  give  brief  summaries  of  some  interesting  areas  of  ongoing  activity.  We  will 
discuss  large  problems,  data  fitting,  secant  updates,  singular  problems,  and  parallel  computation. 

6.1  Large  Problems 

There  are  three  different  scenarios  we  wish  to  consider  here: 

(i)  A  quadratic  model  can  be  formed  and  the  model  Hessian  can  be  factored  if  sparsity  is  taken  into 
account. 

(ii)  A  quadratic  model  can  be  formed  but  linear  iterative  methods  must  be  used  to  solve  the  model  prob¬ 
lem  in  place  of  matrix  factorizations. 

(iii)  The  problem  is  too  large  for  any  explicit  use  of  a  model  Hessian. 

It  is  important  to  note  that  the  number  of  variables  alone  does  not  determine  which  class  fits  a  partic¬ 
ular  problem.  If  a  big  problem  has  a  nice  enough  sparsity  structure  in  the  Hessian  it  fits  in  class  (i);  if  the 
sparsity  structure  is  not  so  nice  it  fits  in  (ii);  and  if  it  isn't  sparse  enough  for  (ii)  it  fits  in  (iii). 

For  problems  in  class  (i),  our  first  choice  would  be  to  use  a  Newton  or  finite  difference  Newton 
model  as  outlined  in  Section  3.5.  If  we  wish  to  use  a  secant  method,  then  we  should  use  a  so-called  "lim¬ 
ited  memory"  method  in  the  spirit  of  the  last  implementation  in  Section  3.3.  Probably,  these  methods  fit 
better  with  a  linesearch  rather  than  a  trust  region.  Buckley  and  Lenir  [1983]  is  a  limited  memory  method 
which  has  been  highly  recommended  to  us  by  users. 

For  problems  of  class  (ii),  there  is  an  elegant  generalization  of  the  dogleg  algorithm  (Steihaug 
[1980],  Toint  [1981])  which  has  shown  its  mettle  in  dealing  with  seismic  inversion.  This  algorithm  can  be 
viewed  as  a  trust  region  implementation  of  the  conjugate  direction  inexact  Newton  method.  The  idea  is 
simple.  Given  a  quadratic  model  and  a  trust  radius,  perform  conjugate  direction  iterations  to  compute  the 
Newton  step  until  either  the  Newton  step  is  computed  inside  the  trust  region  and  taken  as  the  trial  step,  or 
until  some  conjugate  direction  iterate  lands  outside  the  trust  region.  When  the  latter  happens,  the  trial  step 
is  taken  to  be  at  the  intersection  of  the  trust  region  and  the  last  conjugate  direction  step,  or  a  direction  of 


negative  curvature  for  the  quadratic  model  is  generated  at  some  conjugate  direction  iterate  inside  the  trust 
region,  and  the  trial  step  is  taken  where  this  negative  curvature  direction  strikes  the  boundary  of  the  trust 
region.  If  a  preconditioner  is  used,  it  can  be  thought  of  as  defining  the  shape  of  the  trust  region. 

As  computer  storage  has  become  less  expensive,  fewer  problems  must  be  relegated  to  class  (iii). 
Generally,  these  problems  are  solved  using  the  non-matrix  form  of  conjugate  direction  methods  mentioned 
at  the  end  of  Section  5.2. 

Finally,  Griewank  and  Toint  [1982a,b]  have  suggested  and  analyzed  an  interesting  approach  to 
obtaining  Hessian  approximations  for  problems  where  the  objective  function  can  be  written  as  the  sum  of 
objective  functions  that  each  involves  its  own  subset  of  the  variables.  Generally  speaking,  the  summand 
functions  should  have  small  dense  Hessians  which  are  positive  semidefinite  at  the  point  formed  by  select¬ 
ing  the  relevant  subset  of  the  components  of  the  minimizer  of  the  problem.  This  allows  an  approximate 
Hessian  of  the  sum  to  be  assembled  from  (for  example)  BFGS  updates  of  the  summand  Hessians.  The 
reader  familiar  with  research  on  sparse  matrices  will  recognize  the  connection  with  so-called  clique  or 
finite-element  storage.  See  Duff,  Erisman,  and  Reid  [1986]. 

6 2  Data  Fitting 

One  of  the  most  common  sources  of  optimization  problems  is  in  parameter  estimation  arising  from 
fitting  mathematical  models  to  data.  Typically,  data 

(/..*.),  i=l,  •  /n  (6.2.1) 

has  been  collected,  and  one  wants  to  select  the  free  parameters  x  e  IR"  in  the  model  m  (r  jc )  so  that 

mQi.xyzyi,  .  (6.2.2) 

Such  problems  arise  in  almost  all  areas  of  science,  engineering,  and  social  science.  A  more  common  nota¬ 
tion  in  these  applications  is  (x,  ,y,),  i=l,..,n  for  the  data,  and  /  (x  ,9),  06  IR'’  for  the  model  (0  may  be 
replaced  by  3  or  other  symbols),  but  we  will  use  (6.2. 1-2)  to  be  consistent  with  the  remainder  of  this  book. 
Generally,  there  are  far  more  data  points  than  parameters,  that  is  m  »n  in  (6.2.2). 


Usually  it  is  assumed  that  each  t,  is  known  exactly  in  (6.2.1)  but  that  y,  is  measured  with  some  error. 
In  that  case  it  makes  sense  to  achieve  (6.2.2)  by  making  each  residual 

r,(x)  =  m(tl,x)-yi 

as  small  as  possible.  Let 

R(x)  =  (r !<*) . rm(x)f  . 

Then  we  wish  to  choose  x  so  that  the  vector  R  (x )  is  as  small  as  possible,  in  some  sense. 

If  we  choose  x  to 

minimize  II/?  (x)  II  i  or  minimizel!/? (x)ll»  if.  i  •}-, 

then  we  have  a  non-differentiable  unconstrained  optimization  problem.  There  has  been  considerable 
research  into  methods  for  solving  (6.2.3),  see  e.g.  Gill,  Murray,  and  Wright  [1981],  Murray  and  Overton 
[1980,1981],  Bartels  and  Conn  [1981]  and  Conn  [1985].  Research  is  continuing  into  producing  algorithms 
and  software  for  such  problems. 

It  is  much  more  common  to  use  the  1 2  norm  instead  of  (6.2.3),  i.e. 

minimize  /  (x )  =  y  r<  (x  )2  =  jR  (x  )TR  (x )  (6.2.4) 

If  each  yt  =m  (r.  ,x )  +  e,  for  some  true  parameter  value  x  and  some  random  error  e, ,  and  if  the  m  random 
errors  arise  from  independent  and  identical  normal  distribution  with  mean  0,  then  (6.2.3)  produces  the  max¬ 
imum  likelihood  estimator  of  x.  In  any  case,  (6.2.4)  generally  is  easier  to  solve  than  (6.2.3). 

If  R  (x)  is  linear  in  x  then  (6.2.4)  is  a  linear  least  squares  problem  and  it  can  be  solved  in  0 (mn2) 
operations;  see  Stewart  [1973],  Lawson  and  Hanson  [1974],  or  Golub  and  Van  Loan  [1983].  Otherwise  it 
is  a  nonlinear  least  squares  problem.  The  nonlinear  least  squares  problem  is  a  particular  case  of  uncon¬ 
strained  optimization  and  can  be  solved  by  the  techniques  of  the  preceding  sections.  But  many  special  pur¬ 
pose  methods  have  arisen  that  take  advantage  of  the  special  form  of  the  derivatives  of  /  (x ), 

7/(J:)  =  i|I j  ^ri(x)  Tiix)  m  J(x  )T  R  (x ) 
where  J(x)eTRmM  denotes  the  Jacobian  matrix  of/?(x),  and 

V2/(x)  =  Jj  (Vr,  (x  )Vr,  (x  )T  +  V2r,  (x  )r,  (x))«J  (x  )TJ  (x )  +  5  (x )  (6.2.5) 

where  S  (x  )e  IR**"  is  the  part  of  V2/  (x )  that  is  a  function  of  second  derivatives  of  R  (x ). 
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If  both  first  and  second  derivatives  of  R  (x )  are  available  (and  n  is  not  too  large)  then  the  nonlinear 
least  squares  problem  should  probably  just  be  solved  by  standard  unconstrained  optimization  methods. 
Otherwise  the  goal  of  most  special  purpose  nonlinear  least  squares  methods  is  to  produce  efficient  methods 
that  require  only  analytic  or  finite  difference  first  derivatives  of  R(x).  This  supplies  V/ (x)  and  the  first 
part  of  V2/  (x),  but  not  die  second  order  part  S  (x).  Many  strategies  for  using  this  special  structure  exist, 
and  we  summarize  them  very  briefly.  For  more  detail  see  Dennis  and  Schnabel  [1983]. 

Gauss-Newton  and  Levenberg-Marquardt  methods  simply  omit  S(x)  and  base  their  step  on  the 

model 

m  (xc  +d)  =  /  (xe )  +  Vf  &  )Td  +  \drJ  (xc  )TJ  (xc  )d  (6.2.6) 

This  is  very  reasonable  if  the  data  (6.2.1)  is  fit  well  by  the  optimal  model  m  (t,  jc* )  since  in  this  case  R  (x» ) 
and  hence  5  (x. )  will  be  nearly  zero,  and  omitting  S  (x )  will  cause  little  harm.  Methods  that  use  a  line 
search  in  conjunction  with  (6.2.6)  are  generally  called  Gauss-Newton  methods,  while  the  use  of  a  trust 
region  leads  to  More 's  [1978]  derivation  of  the  Levenberg-Marquardt  method.  This  method  is  imple¬ 
mented  in  MINPACK.  (More  ,  Garbow,  and  Hills trom  [1980]  and  has  been  quite  effective  in  practice.  Note 
that  (6.2.6)  is  equivalent  to 

m(xc+d)=  y  Ilf? (xe)+y(xc)d  II ^ 

and  so  these  methods  can  be  derived  from  the  linear  Taylor  series  model  of  R  (x ). 

Alternatively,  secant  methods  for  nonlinear  least  squares  construct  approximations  to  V2/  (x )  given 
by  (6.2.5).  Some  codes  have  used  the  methods  of  Section  3.4  to  approximate  all  of  V2/  (x),  but  more  suc¬ 
cessful  methods  have  arisen  by  approximating  V2/  (xe )  by 

y(Xc)T7(xe)+Ac 

where  Ae  approximates  S  (xc ).  That  is,  the  available  pan  of  the  Hessian  is  used  and  only  the  unavailable 
part  is  approximated.  Dennis,  Gay  and  Walsch  [1981a,b]  constructed  a  very  effective  method  along  these 
lines.  In  general,  such  quasi-Newton  methods  are  more  effective  than  modem  Levenberg-Marquardt 
methods  on  problems  where  R  (x. ),  and  hence  5 (x. )  is  large,  and  of  comparable  effectiveness  otherwise. 

Research  is  continuing  on  various  aspects  of  nonlinear  least  squares  calculations,  including  large 
residual  problems  (Salane  [1987]),  large  dimensional  problems  (Toint  [1987]),  secant  approximation  (Al- 


Baali  and  Fletcher  [1983],  and  application  of  the  tensor  methods  mentioned  in  Section  6.5  (Hanson  and 
Krogh  [1987],  Bouaricha  and  Schnabel  [1987]). 

An  interesting  variant  of  the  data  fitting  problem  occurs  when  there  is  experimental  error  in  both  i, 
and  y, .  In  this  case  it  becomes  appropriate  to  measure  the  distance  between  the  data  point  (r,  ,y, )  and  the 
fitting  function  m  (t  jc)  by  the  (weighted)  Euclidean  distance  from  the  point  to  the  curve.  Boggs,  Byrd,  and 
Schnabel  [1987]  show  that  minimizing  the  sum  of  the  squares  of  these  distances  leads  to  the  problem 

minimize  f  ((m  (i,  +  8,j  )-y,  )2  +  w.^S2)  (6.2.7) 

for  appropriate  weights  w, .  Problem  (6.2.7)  is  a  nonlinear  least  squares  problems  in  m+n  variables,  but 
Schwetlick  and  Tiller  [1985]  and  Boggs,  Byrd,  and  Schnabel  [1987]  show  how  to  solve  it  efficiently  using 
essentially  the  same  work  per  iteration  as  for  the  standard  nonlinear  least  squares  problem  (6.2.4).  Boggs, 
Byrd,  Donaldson,  and  Schnabel  [1987]  provide  a  software  package  for  (6.2.7).  The  case  when  m(:jc)  is 
linear  in  both  t  and  x  is  addressed  in  Golub  and  Van  Loan  [1980]. 

Finally,  there  is  an  increasing  amount  of  cross -fertilization  between  the  optimization  and  statistical 
communities  in  the  study  of  data  fitting  problems.  Areas  of  interest  include  the  application  of  modem 
optimization  techniques  to  specialized  data  fitting  problems  (see  e.g.  Bates  and  Watts  [1987],  Bunch 
[1987],  and  the  statistical  analysis  of  parameter  estimates  obtained  by  nonlinear  least  squares  (see  e.g. 
Bates  and  Watts  [1980],  Cook  and  Witmcr  [1985],  Donaldson  and  Schnabel  [1987], 

6_3  Secant  Updates 

The  investigation  of  alternative  secant  methods  for  unconstrained  optimization  is  enjoying  a  recent 
revival.  This  was  a  very  active  area  in  the  1960’s  and  70's,  starting  with  the  discovery  of  the  first  secant 
update,  the  DFP  update  (3.4.7),  and  continuing  with  the  discovery  of  the  BFGS  update  (3.4.5).  Much 
interest  was  focused  on  the  Broyden  family  of  updates  (Broyden  [1970]) 

fl*(0)  =  0B?"  +  ( l-0)fl  iFCS  (6.3.1) 

which  differ  from  each  other  only  by  multiples  of  a  rank-one  matrix;  this  topic  is  discussed  extensively  in 

Fletcher  [1980],  Several  convergence  results  for  any  choice  of  0e[O,l]  have  been  proven;  see  e.g. 

Griewank  and  Toint  [1982b]  and  Stachurski  [1981].  But  the  consensus  for  over  10  years  has  been  that  the 
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BFGS  is  best  in  practice.  One  hint  is  a  powerful  convergence  result  of  Powell  [1976]  for  the  BFGS  that 
has  never  been  extended  to  the  DFP. 

Some  recent  research  has  attempted  to  explain  theoretically  the  superiority  of  the  BFGS.  Powell 
[1986]  uses  a  simple  example  to  show  that  the  DFP  can  be  very  much  slower  than  the  BFGS.  Byrd, 
Nocedal,  and  Yuan  [1987]  extend  Powell’s  1976  convergence  result  to  any  0€  (0,1],  i.e.  any  convex  combi¬ 
nation  of  the  DFP  and  the  BFGS  except  the  DFP,  and  in  doing  so  show  that  the  DFP  lacks  a  self-con-ecung 
term  in  its  update  formula.  Dennis,  Martinez,  and  Tapia  [1987]  show  that  the  BFGS  has  an  optimal 
bounded  deterioration  property  in  the  convex  class. 

Other  recent  research  has  resumed  computational  and  theoretical  investigation  of  choices  other  than 
the  BFGS  (0  =  0)  in  (6.3.1).  Zhang  and  Tewarson  [1986]  consider  using  0<O;  they  extend  Powell's  con¬ 
vergence  result  to  their  strategy  and  their  computational  results  show  that  it  may  produce  better  results  than 
the  BFGS  in  practice.  Conn,  Gould,  and  Toint  [1981]  revisit  the  symmetric-rank  on  update, 

B+-  Bc  +  —  •  - 

the  choice  0  =  (yJscy{yc-BcSc)Tsc  in  (6.3.1),  which  may  have  a  zero  denominator,  and  show  that  it  may 
be  competitive  with  the  BFGS  when  used  in  conjunction  with  a  trust  region  method  and  safeguarding  of  the 
denominator. 

Dennis  and  Walker  [1985]  and  Vu  [1984]  have  studied  the  problem  of  dealing  with  noise  in  y  for  the 
least  change  Frobenius  norm  updates,  but  the  more  important  analysis  for  the  BFGS  update  is  not  so  well 
understood. 

A  very  important  practical  problem  in  secant  updating  comes  from  constrained  optimization.  We 
want  to  extend  the  BFGS  method  to  maintain  an  approximation  to  the  Hessian  of  the  La  gran  gi  an  with 
respect  to  the  primal  variables.  The  most  commonly  used  method  is  due  to  Powell  [1978],  but  he  shows  in 
Powell  [1985]  that  it  can  lead  to  ill-conditioning  in  the  approximate  Hessians.  Tapia  [1984]  suggests  and 
analyzes  a  promising  procedure. 

Some  very  elegant  work  on  secant  methods  for  nonlinear  problems  that  come  from  discretization  of 
infinite  dimensional  problems  has  been  done  by  Gricwank  [1983]  and  Kelley  and  Sachs  [1986].  They  give 
a  new  derivation  of  the  method  for  nonlinear  two  point  boundary  value  problems  suggested  in  Han  and 
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Soul  [1973].  The  idea  behind  this  work  is  to  consider  the  operator  equation  in  its  natural  function  space 
setting.  Roughly  speaking,  one  defines  a  least-change  secant  method  for  the  operator  equation  using  the 
norm  in  which  Newton’s  method  could  be  shown  to  have  its  familiar  convergence  properties.  This  gives  a 
point-wise  quasi-Newton  method  equivalent  to  a  local  affine  model  which  one  discretizes  and  solves.  In 
other  words,  linearization  precedes  discretization  instead  of  the  standard  approaches  in  which  one  discre¬ 
tizes  the  operator  equation  and  then  iteratively  linearizes  it. 

Finally,  the  advent  of  parallel  computers  is  leading  to  revived  interest  in  multiple  secant  updates. 
These  are  methods  that  attempt  to  satisfy  more  than  one  secant  equation  at  each  iteration  in  order  to  inter¬ 
polate  more  than  one  previous  value  of  g(x)  for  optimization,  or  of  F(x)  for  nonlinear  equations.  They 
were  first  mentioned  by  Barnes  [1965]  for  nonlinear  equations  and  shown  by  Gay  and  Schnabel  [1978]  and 
Schnabel  and  Frank  [1987]  to  lead  to  small  gains  over  Broyden’s  method  for  nonlinear  equations.  The 
application  of  this  approach  to  unconstrained  optimization  has  fundamental  limitations,  see  Schnabel 
[1983].  But  these  methods  now  seem  naturally  suited  to  parallel  computation  where  multiple  values  of 
g  (jc)  or  F(x)  may  be  calculated  at  one  iteration;  see  e.g.  Byrd,  Schnabel,  and  Shultz  [1987], 

6.4  Singular  Problems 

There  are  a  number  of  practical  problems  for  which  J{x- )  (for  nonlinear  equations)  or  V2/  (*. )  (for 
optimization)  are  singular  or  nearly  singular.  We  call  these  singular  problems.  None  of  the  convergence 
results  of  Sections  2-3  apply  to  singular  problems,  and  by  considering  one  variable  problems  we  can  see 
that  slow  convergence  is  to  be  expected  For  example,  applying  Newton’s  method  for  nonlinear  equations 

to  solve  x2  =  0  produces  linear  convergence  with  constant  -j,  while  applying  Newton’s  method  for  opumi- 
zadon  to  minimize  x *  gives  linear  convergence  with  constant  -j .  Dearly  the  standard  linear  and  quadratic 

models  are  less  helpful  in  these  cases  since  all  the  derivatives  used  in  the  standard  models  approach  zero  as 
x  converges  to  x • . 

There  has  been  a  considerable  amount  of  recent  research  analyzing  the  behavior  of  standard  methods 
on  singular  problems,  and  suggesting  improved  methods.  Most  of  this  research  has  been  for  nonlinear 
equations  problems  and  is  summarized  excellently  in  Gncwank  [1985].  We  will  give  a  very  bnef 
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indication  of  this  work. 

Many  researchers  have  analyzed  the  convergence  of  Newton’s  method  for  solving  singular  systems 
of  nonlinear  equations.  For  ’’regular"  singularities,  results  such  as  those  in  Decker  and  Kelley  [1980a,b]  or 
Griewank  [1980]  show  that  one  can  still  expect  linear  convergence  with  the  constant  converging  to  y. 

Decker  and  Kelley  [1985]  show  that  Broyden’s  method,  like  the  one-dimensional  secant  method,  is  linearly 
convergent  on  singular  problems  with  constant  converging  to  C^5-l)/2=0.62,  from  certain  starting  points. 

Various  acceleration  schemes  for  solving  singular  systems  of  equations  have  been  proposed,  and 
they  are  surveyed  in  Griewank  [1985].  Many  authors  have  suggested  methods  related  to  the  one  variable 
idea  of  doubling  the  Newton  step  that  are  intended  solely  for  singular  problems.  One  difficulty  in  applying 
these  techniques  is  that  one  may  not  know  a  priori  whether  the  problem  is  singular.  Some  methods  from 
curve  following  arc  also  applicable  to  singular  systems  of  equations,  (see  e.g.  Moore  and  Spense  [1980]). 
Griewank  [1985]  and  Schnabel  and  Frank  [1984,1987]  have  proposed  methods  that  are  applicable  to  both 
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singular  and  nonsingular  problems.  These  methods  append  a  simple  low  rank  quadratic  term  to  the  stan¬ 
dard  linear  model,  in  a  way  that  doesn't  significantly  increase  the  costs  of  forming,  storing,  or  solving  the 
model.  Schnabel  and  Frank  report  that  their  "tensor"  methods  lead  to  significant  improvements  on  both 
singular  and  nonsingular  test  problems. 

The  solution  of  singular  unconstrained  optimization  problems  is  more  complex.  This  is  related  to  the 
fact  that  for  one  variable  problems,  if  /“  (x. )  =  0,  then  we  must  also  have  /“  (x. )  =  0  and  f"  (x-  )20.  Simi¬ 
larly,  Griewank  and  Osborne  [1983]  show  that  if  V2/  (x.  )v  =0  at  a  mirumizer  x. ,  then  we  must  have 
V3/  (x-  )vw  =  0,  V4/  (x» ) ww  ^0,  and  V3/  (x  )w  not  too  large.  These  conditions  imply  that  the  singularity 
is  "irregular"  and  invalidate  most  of  the  approaches  mentioned  above.  Schnabel  and  Chow  [1987]  intro¬ 


duce  a  "tensor"  method  that  appends  low  rank  third  and  fourth  order  terms  to  the  standard  quadratic  model, 
without  requiring  any  additional  function  or  derivative  evaluations  and  without  appreciably  increasing  the 
cost  of  forming,  storing,  or  solving  the  model.  They  report  that  their  approach  leads  to  a  substantial  reduc¬ 
tion  in  the  cost  of  solving  both  singular  and  nonsingular  test  problems. 
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6.5  Parallel  Computation 


An  important  recent  development  in  computing  is  the  commercial  availability  of  parallel  computers. 
computers  that  can  perform  multiple  operations  concurrently.  These  include  M1MD  (Multiple  Instrucuon 
Multiple  Data)  computers  that  allow  different  instruction  streams  to  be  executed  concurrently,  processor 
arrays  that  apply  the  same  instruction  stream  to  multiple  data  streams  concurrently,  and  vector  computers 
that  use  data  pipelining  to  rapidly  perform  pairwise  addition  or  multiplication  of  vectors.  Since  it  appears 
that  many  of  the  significant  future  gains  in  computer  speed  will  come  from  effectively  utilizing  such 
machines,  it  is  becoming  important  to  design  optimization  methods  that  utilize  them  efficiently  Virtually 
all  the  research  in  this  area  is  quite  recent,  and  we  will  simply  indicate  some  of  the  approaches  that  are 
emerging.  These  are  primarily  oriented  towards  MIMD  computers,  which  seem  to  be  the  class  of  parallel 
computers  best  suited  towards  parallel  optimization  because  they  support  concurrency  at  a  coarse -grain 
algorithmic  level. 

One  approach  towards  parallel  optimization  is  to  design  general  purpose  parallel  variants  of  the 
Newton  or  quasi-Newton  methods  discussed  in  Sections  2-4.  These  types  of  methods  have  have  two 
potentially  significant  costs  that  must  be  considered  in  constructing  parallel  versions.  They  are  the  evalua¬ 
tions  of  funcuons  and  derivatives,  and  the  linear  algebra  costs  in  solving  linear  systems  or  updaung  secant 
approximations.  Both  can  be  adapted  to  parallel  computers. 

The  most  obvious  way  to  use  a  parallel  computer  effectively  during  the  evaluation  of  funcuons  or 
derivatives  is  to  perform  the  multiple  funcuon  evaluations  of  a  finite  difference  gradient  or  Hessian  calcula¬ 
tion  concurrently  (Dixon  [1981],  Patel  [1982],  Lootsma  [1984]).  Schnabel  [1986]  introduces  the  idea  of 
performing  a  speculative  finite  difference  gradient  evaluation  concurrently  with  the  evaluation  of  /  (* )  at  a 
trial  point,  before  it  is  known  whether  this  point  will  be  accepted  as  the  next  iterate.  Since  the  acceptance 
rate  for  tnal  points  usually  is  at  least  70%,  this  gradient  value  will  usually  be  needed,  so  this  simple  stra¬ 
tegy  will  utilize  n+1  or  fewer  processors  quite  efficiently  if  funcuon  evaiuauon  is  the  dominant  cost.  Byrd. 
Schnabel,  and  Shultz  [1987]  invesugate  the  more  difficult  question  of  effectively  utilizing  more  than  n  +  1 

(but  fewer  than  \n2)  processors;  this  leads  to  new  optimization  methods  that  usually  require  significantly 
fewer  iterauons  that  the  BFGS  method.  An  alternative  (see  e.g.  Patel  [1982])  is  to  evaluate  f  (x)  at  many 
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trial  points  simultaneously,  but  this  appears  unlikely  to  produce  as  much  increase  in  speed. 

An  n  become  large  it  is  also  important  to  perform  the  linear  algebra  calculations  in  parallel.  For 
methods  that  use  the  Hessian,  parallel  matrix  factorizations  and  backsolves  are  required  and  many  effective 
algorithms  have  been  produced  (see  e.g.  Heller  [1978],  Geist  and  Heath  [1986]).  For  BFGS  methods,  Han 
[1986]  has  proposed  an  implementation  that  sequences  a  factorization  ZZT  of  the  inverse  of  the  Hessian 
approximation  and  is  well  suited  to  parallel  computers.  An  alternative  is  to  utilize  the  unfactored  inverse 
update  (3.4.6),  which  appears  to  parallelize  as  well  and  require  fewer  arithmetic  operations.  Traditionally 
there  has  been  some  concern  over  the  numerical  stability  of  these  approaches  (see  e.g.  Powell  [1987]);  this 
issue  is  now  being  re-examined  since  they  seem  better  suited  to  parallel  computation  than  the  Cholesy  fac¬ 
torization  update  (3.4.4). 

It  will  also  be  increasingly  important  to  develop  specific  parallel  optimization  methods  for  particular 
classes  of  expensive  optimization  problems.  Some  early  examples  include  Dixon,  Ducksbury,  and  Singh 
[1982],  and  Dixon  and  Spedicato  [1985].  There  has  also  been  work  on  parallel  methods  related  to  other 
optimization  methods  we  have  discussed;  for  example  see  Housos  and  Wing  [1980]  for  a  parallel  conjugate 
direction  method,  and  Lescrenier  [1986]  for  a  parallel  approach  to  partially  separable  optimization. 
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